Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: make_block #56422

Merged
merged 2 commits into from
Dec 11, 2023
Merged

DEPR: make_block #56422

merged 2 commits into from
Dec 11, 2023

Conversation

jbrockmendel
Copy link
Member

xref #40226

@@ -431,6 +431,7 @@ Other Deprecations
^^^^^^^^^^^^^^^^^^
- Changed :meth:`Timedelta.resolution_string` to return ``h``, ``min``, ``s``, ``ms``, ``us``, and ``ns`` instead of ``H``, ``T``, ``S``, ``L``, ``U``, and ``N``, for compatibility with respective deprecations in frequency aliases (:issue:`52536`)
- Deprecated :func:`pandas.api.types.is_interval` and :func:`pandas.api.types.is_period`, use ``isinstance(obj, pd.Interval)`` and ``isinstance(obj, pd.Period)`` instead (:issue:`55264`)
- Deprecated :func:`pd.core.internals.api.make_block`, use public APIs instead (:issue:`40226`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, is there a public API to create a block?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The point is to ween downstream packages off of our internals

@mroeschke mroeschke added Internals Related to non-user accessible pandas implementation Deprecate Functionality to remove in pandas labels Dec 9, 2023
@mroeschke mroeschke added this to the 2.2 milestone Dec 11, 2023
@mroeschke mroeschke merged commit b0ffccd into pandas-dev:main Dec 11, 2023
50 checks passed
@mroeschke
Copy link
Member

Thanks @jbrockmendel

@jorisvandenbossche
Copy link
Member

Sorry, I think this needs to be reverted. As I mentioned in the referenced issue #40226, pyarrow is using this. And as Brock mentioned above, there is no alternative for this. We can't deprecate this if we don't provide an alternative (and I think make_block actually already is the alternative to directly using the Block constructors, avoiding more usage of internal details).

@jbrockmendel
Copy link
Member Author

Eventually we need to ween pyarrow off of both blocks and managers. Let's see if we can find a viable (i.e. perf hit not-too-big) alternative using public APIs.

An example to use as a benchmark:

import pandas as pd
import numpy as np
from pandas.core.internals.api import make_block
from pandas.core.internals import create_block_manager_from_blocks

ncols = 10
nrows = 10**5

arr1 = np.random.randn(nrows*ncols).reshape(nrows, ncols)
arr2 = arr1.astype(np.int64)

index = pd.Index(range(nrows))
cols = pd.Index(range(ncols * 2))
axes = [cols, index]

def v1():
    locs1 = np.arange(arr1.shape[1])
    blk1 = make_block(arr1.T, placement=locs1)
    locs2 = np.arange(arr1.shape[1], arr1.shape[1] + arr2.shape[1])
    blk2 = make_block(arr2.T, placement=locs2)
    mgr = create_block_manager_from_blocks([blk1, blk2], axes, verify_integrity=False)
    return pd.DataFrame._from_mgr(mgr, axes=axes[::-1])

def v2():
    left = pd.DataFrame(arr1, columns=cols[:ncols])
    right = pd.DataFrame(arr2, columns=cols[ncols:])
    return pd.concat([left, right], axis=1, copy=False)

def v3():
    arrs = list(arr1.T) + list(arr2.T)
    return pd.DataFrame._from_arrays(arrs, columns=cols, index=index, verify_integrity=False)

# check that these are all equivalent
df1 = v1()
df2 = v2()
df3 = v3()
assert df1.equals(df2)
assert df1.equals(df3)

%timeit v1()
23.2 µs ± 460 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%timeit v2()
136 µs ± 1.46 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%timeit v3()
371 µs ± 7.94 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit v4()
198 µs ± 1.71 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Notes:

  • I did this timing with the deprecation in this PR commented-out; I'd expect v1 to be slightly slower on main
  • v3 silently consolidates
  • Except for v3, these numbers are all unchanged(ish) if we scale nrows

The pd.concat version looks entirely viable to me. Yes it is 6x slower, but since it is O(1) we're just talking about 110µs which is just not that big of a deal.

A compromise would be to implement pd.DataFrame._from_2d_arrays_just_for_pyarrow which at least would not touch internals.

@jreback
Copy link
Contributor

jreback commented Dec 11, 2023

how do these scale in terms of columns

and what if we have multiple dtypes?

@jorisvandenbossche
Copy link
Member

The v2 with concat will need a reindex of the columns to be fully equivalent (in real world, the columns with different dtypes are interleaved). I don't directly know how you can currently achieve that without a copy with public APIs (adding a .reindex(columns=cols, copy=False) makes this a lot slower, because it does an actual take on the block values. I don't know how to convince it that it can do a slice with the public reindex interface)

@jbrockmendel
Copy link
Member Author

Once CoW is enabled the reindex is zero-copy

@jorisvandenbossche
Copy link
Member

No, I was already passing copy=False, it's our implementation that needs to figure out it doesn't need to do a take

@jbrockmendel
Copy link
Member Author

Good to know (cc @phofl sounds like something is wrong with CoW+take). Until that is fixed, DataFrame._getitem_nocopy exists for pretty much this purpose.

@jorisvandenbossche
Copy link
Member

_getitem_nocopy will give a fragmented DataFrame, though (at least the current implementation)

@jbrockmendel
Copy link
Member Author

wasn't the original reason for caring about fragmentation that we used to do silent consolidation, which we no longer do?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 12, 2023

We no longer do silent consolidation throughout operations, but our constructors generally still give you consolidated results (and so does the from arrow constructor right now), because AFAIK that is still the most optimal layout in the general case.

And there is also not much point in first creating blocks if you then slice them up into many pieces

@jbrockmendel
Copy link
Member Author

I have a hard time taking seriously a concern about fragmented dataframes being inefficient.

You don't like any of the alternatives or compromises I've suggested. Can you suggest some?

jorisvandenbossche added a commit that referenced this pull request Dec 13, 2023
@jorisvandenbossche
Copy link
Member

For me the compromise is that we only expose a small core subset (single entry point to create blocks, and create a manager from those blocks) of the internals for advanced users, and consider all other internals as fully private. So for example, we limit access to the actual Block classes and their class constructors (i.e. I merged your PR #55139 for that, and I also made corresponding updates to pyarrow to follow that), but we keep a single factory function to create a block (i.e. revert this PR -> #56481)

That is also what you described in the top post of the referenced issue #40226 (and make_block, what you are deprecating here, was exactly added for that purpose).

@phofl
Copy link
Member

phofl commented Dec 14, 2023

The v2 with concat will need a reindex of the columns to be fully equivalent (in real world, the columns with different dtypes are interleaved). I don't directly know how you can currently achieve that without a copy with public APIs (adding a .reindex(columns=cols, copy=False) makes this a lot slower, because it does an actual take on the block values. I don't know how to convince it that it can do a slice with the public reindex interface)

This confuses me a little. I have a couple of questions here:

I am assuming that your columns within a specific dtype are already in the correct order? E.g. the first int column comes before the second int column and then the third and so on, then the next set of columns is from dtype float and they are again ordered correctly within the float dtype?

This should already be zero copy with CoW, because your block placements should always be basically range(0, nr_of_columns), which CoW will convert to a slice.

Example:

df = pd.DataFrame({"a": [1, 2, 3], "d": 2, "e": 3, "b": 1.5, "c": 2.5})
result = df.reindex(columns=["a", "b", "c", "d", "e"], copy=False)

The int columns are all ordered correctly, same for the float columns, we just switch orders between the dtypes, which we can do zero copy

np.shares_memory(get_array(df, "a"), get_array(result, "a"))

Non CoW will still copy

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 15, 2023

OK, I see what happened! I tested this before (with a slightly updated example from Brocks post to interleave the dtypes), and timed with the default mode (around 10ms instead of 20µs), and based on the profile this was because of the Block.take_nd in the reindex step. Then I enabled CoW and it also gave around 10ms, but didn't check the profile (just assumed it had the same cause, also because I was already passing copy=False to reindex assuming we would honor that regardless of CoW).

But redoing that example now, the reason it was also slower with CoW is because I was missing a copy=False in the DataFrame(arr) constructor (where we changed the default to True when CoW is enabled for ndarray input), and apparently that gave more or less the same slowdown as the take in the non-CoW mode ..

With correcting that, I indeed get a zero-copy construction with the correct column order. The overhead in this case is around 400µs vs 20µs for me.
This (concat index and reindexing) overhead grows with the number of columns, and I will also try to check next week the impact of more dtypes, to see if the overhead stays acceptable.

Adapted example:

In [1]: import pandas as pd
   ...: import numpy as np
   ...: from pandas.core.internals.api import make_block
   ...: from pandas.core.internals import create_block_manager_from_blocks
   ...: 
   ...: ncols = 10
   ...: nrows = 10**5
   ...: 
   ...: arr1 = np.random.randn(nrows*ncols).reshape(nrows, ncols)
   ...: arr2 = arr1.astype(np.int64)
   ...: 
   ...: index = pd.Index(range(nrows))
   ...: cols = pd.Index(range(ncols * 2))
   ...: axes = [cols, index]
<ipython-input-1-5c31cca232e4>:4: DeprecationWarning: create_block_manager_from_blocks is deprecated and will be removed in a future version. Use public APIs instead.
  from pandas.core.internals import create_block_manager_from_blocks

In [2]: def v1():
   ...:     locs1 = np.arange(arr1.shape[1]*2, step=2)
   ...:     blk1 = make_block(arr1.T, placement=locs1)
   ...:     locs2 = np.arange(1, arr1.shape[1]*2, step=2)
   ...:     blk2 = make_block(arr2.T, placement=locs2)
   ...:     mgr = create_block_manager_from_blocks([blk1, blk2], axes, verify_integrity=False)
   ...:     return pd.DataFrame._from_mgr(mgr, axes=axes[::-1])
   ...: 

In [3]: def v2():
   ...:     left = pd.DataFrame(arr1, columns=cols[0:ncols*2:2], copy=False)
   ...:     right = pd.DataFrame(arr2, columns=cols[1:ncols*2:2], copy=False)
   ...:     return pd.concat([left, right], axis=1, copy=False).reindex(columns=cols, copy=False)
   ...: 

In [4]: %timeit v1()
24.6 µs ± 399 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [5]: %timeit v2()
16.1 ms ± 1.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: df1 = v1()

In [7]: df2 = v2()

In [8]: pd.testing.assert_frame_equal(df1, df2)

In [9]: pd.options.mode.copy_on_write = True

In [10]: %timeit v2()
463 µs ± 8.93 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

@jorisvandenbossche
Copy link
Member

I haven't had the time this week to further test it. But in any case, this are relevant alternatives for after 3.0, so I am planning to merge #56481 for now

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Dec 22, 2023 via email

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 22, 2023

Can you then explain why this deprecation needs to happen now before 3.0? It's not deprecating anything that we want to change, it's just deprecating access to something we prefer people not to use (but that access is already private anyway, and as far as we know, pyarrow (and mars, based on a github code search (it was a different function)) is the only public users of this).
My understanding is that there is nothing time sensitive about this deprecation (we wouldn't be able to already remove it in 3.0 anyway, even if deprecated now for 2.2, as that would be way too fast)

PyArrow cannot avoid this deprecation warning right now without a performance hit, so IMO it simply shouldn't be deprecated right now.

And I want to point out that it is this PR that is controversial, merged within 2 days after opening it, and not in line what was discussed on the issue it references (#40226). So I won't self-merge, but I would appreciate if someone would merge the revert (#56481), so we can can take the time discuss this properly, without the pressure of it already being merged and an upcoming release.

@jbrockmendel
Copy link
Member Author

Can you then explain why this deprecation needs to happen now before 3.0?

Not getting it in for 2.x means waiting another year before actually enforcing it.

PyArrow cannot avoid this deprecation warning right now without a performance hit, so IMO it simply shouldn't be deprecated right now.

That is going to be the case regardless of when the deprecation actually occurs. There is zero evidence of pyarrow being willing to change their usage without a deprecation.

Also to add to the list of alternatives above: just use pd.ArrowDtype which makes way more sense anyway.

@lithomas1 lithomas1 mentioned this pull request Dec 27, 2023
5 tasks
@jorisvandenbossche
Copy link
Member

Can you then explain why this deprecation needs to happen now before 3.0?

Not getting it in for 2.x means waiting another year before actually enforcing it.

There is no way that this could be enforced in 3.0 anyhow, as it would break all released versions of pyarrow (a dependency we want to make required for 3.0 ..). So a change like this that breaks a (required) dependency has to be spread over more than a year anyaway, I think.

PyArrow cannot avoid this deprecation warning right now without a performance hit, so IMO it simply shouldn't be deprecated right now.

That is going to be the case regardless of when the deprecation actually occurs.

No, that is not correct. The main alternative being discussed above (concat + reindex) will be possible starting in the future (3.0, when CoW is enabled), but not right now.

Also to add to the list of alternatives above: just use pd.ArrowDtype which makes way more sense anyway.

I don't think that make sense at the moment. PyArrow's conversion to pandas follows pandas' own choices in defaults, and for the rest it is not opinionated.

The fact that this conversion lives in PyArrow is (I think, this is from before my time) mostly historical and for packaging reasons. Because it was easier when pyarrow was young and evolving, and because a dependency in pandas on Arrow C++ would be annoying with Python's state of packaging. But you could perfectly argue that this code rather belongs in pandas, and at that point we would also just be using the APIs that pyarrow now uses.

@jbrockmendel
Copy link
Member Author

There is no way that this could be enforced in 3.0 anyhow

The idea is to have this be a DeprecationWarning in 2.x, then a FutureWarning in 3.x, then enforced in 4.0.

@rhshadrach
Copy link
Member

Can this be a DeprecationWarning in e.g. 3.0, a FutureWarning in 3.1, and then enforced in 4.0?

@jorisvandenbossche
Copy link
Member

See also my comment at #56481 (comment) about the timeline.

It's difficult to talk exactly about version numbers, because we don't know the exact timeline for those releases. If pandas 4.0 would happen a year after 3.0 (like 3.0 is now planned a year after 2.0), then I think 4.0 would be too early for enforcing it.
But maybe 4.0 happens a half year or year later, and then that would be fine.

I also don't think this time of enforcement is that important for pandas. It's not like a user facing API where we want to change the behaviour, and want to get the better behaviour as fast as reasonably possible.

@jorisvandenbossche
Copy link
Member

But to answer your actual question: even if we could already enforce it in 4.0, I think it is indeed perfectly fine to only have the DeprecationWarning in 3.0 (instead of now in 2.2) and change from Deprecation to FutureWarning somewhere between 3.0 and 4.0.

lithomas1 pushed a commit that referenced this pull request Jan 10, 2024
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Jan 10, 2024
lithomas1 pushed a commit that referenced this pull request Jan 11, 2024
…") (#56814)

Backport PR #56481: Revert "DEPR: make_block (#56422)"

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants