Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add low-level create_dataframe_from_blocks helper function #58197

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Apr 9, 2024

See my explanation at #56815

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@@ -0,0 +1,50 @@
from pandas import DataFrame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a docstring (module and/or function level) to the effect of "we discourage this for everyone except pyarrow. if you think you have a use case for this, let us know at [...]"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phofl might also have a use case in dask (I don't know id you already have a better idea now if that would be the case?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we are working on changing how we shuffle data were this would be helpful (we will get a huge number of small data frames, so overhead is painful), but I agree that we should strengthen this a little bit that makes it clear that end users shouldn't need this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already added a more generic note "For almost all use cases, you should use the standard pd.DataFrame(..) constructor instead." without naming specific libraries that use this.

What would we gain with a "if you think you have a use case for this, let us know at"? Learning about use cases where people would use this is certainly valuable, but in the end it will be public developer API and so if we would in the future change or remove it, we need to go through normal deprecation processes anyway, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully we'll never have to revisit this again. But if we do, there is evidence that discussions around a deprecation here would be more painful than elsewhere. It would be helpful to know ahead of such a discussion if anyone else was using it. Moreover, the "let us know" is a chance to try to talk anyone out of using this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added "If you are planning to use this function, let us know by opening an issue at https://github.com/pandas-dev/pandas/issues."

@jbrockmendel
Copy link
Member

Not my favorite thing in the world, but better than the status quo so i'm on board.

@mroeschke mroeschke added API Design Internals Related to non-user accessible pandas implementation labels Apr 9, 2024
@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review April 10, 2024 13:51
@jorisvandenbossche
Copy link
Member Author

I added some basic roundtrip tests and tests for the corner cases that I am aware of (the numpy arrays instead of EA for datetime/timedelta, and passing 1D EAs for cases that are stored 2D internally).
(and using this in pyarrow is also passing the test suite there)

@jorisvandenbossche jorisvandenbossche added this to the 3.0 milestone Apr 11, 2024
@jorisvandenbossche
Copy link
Member Author

@jbrockmendel @phofl could I get a more in-depth review (if there are remaining comments)? I would like to get the change to use this included in pyarrow 16, so we don't have to worry about deprecation warnings when merging #57754 for 3.0 (for the case of using the latest released version of both libraries), but then I need to get this in in the coming days.

@mroeschke mroeschke mentioned this pull request Apr 12, 2024
@jorisvandenbossche
Copy link
Member Author

Small reminder here

@phofl phofl merged commit ae246a6 into pandas-dev:main Apr 15, 2024
46 checks passed
@phofl
Copy link
Member

phofl commented Apr 15, 2024

thx @jorisvandenbossche

@jorisvandenbossche jorisvandenbossche deleted the internals-dataframe-from-blocks branch April 15, 2024 19:21
@jbrockmendel jbrockmendel mentioned this pull request Apr 24, 2024
5 tasks
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants