Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-10: Add pyarrow as a required dependency #52711

Merged
merged 40 commits into from Jul 30, 2023

Conversation

mroeschke
Copy link
Member

I've tried to summarize the motivations and drawbacks from #52509 in this PDEP. Please let me know if there are any reasons on either side that I am missing.

Feel free to continue the discussion in #52509, but as a reminder the voting will take place here.

cc @pandas-dev/pandas-core

@mroeschke mroeschke added Dependencies Required and optional dependencies Arrow pyarrow functionality PDEP pandas enhancement proposal labels Apr 17, 2023
@mroeschke mroeschke changed the title PDEP: Add pyarrow as a required dependency PDEP-10: Add pyarrow as a required dependency Apr 17, 2023
by PyArrow including:

- Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations
- Avoiding NumPy object data types more by default for analogous types that have native PyArrow support such as decimal, binary, and nested types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be too optimistic, but having pyarrow as a required dependency has the potential to make the c/cython-code for read_csv and read_json obsolete (if they are on par and similarly fast).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would be a compile time dependency which we are not contemplating at the current time; possibly could propose in the future

Comment on lines 16 to 17
- The minimum version of PyArrow will be bumped every major pandas release to the highest
PyArrow version that has been released for at least 2 years.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using the major.minor.patch terminology, major could be 2-3 years (ignoring for now the proposal by some to make this more frequent) and minor is 6-9 months.

It is not clear here, is the minimum supported version kept for all minor releases in this proposal?

Near the tail end of the major release cycle, the minimum supported version of pyarrow could be 5 years old?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear here, is the minimum supported version kept for all minor releases in this proposal?

Correct

@bashtage
Copy link
Contributor

Will there also be a version cap? NumPy is extremely conservative with breaking changes. I can't recall a case where a cap would have helped avoid issues with NumPy changes, especially deprecations and removals. Is pyarrow similarly stable? Does pyarrow have an implicit or explicit deprecation policy? If not, would there need to be a cap on each release too?

Recently in a number of projects I've been downstream of SciPy which has been doing a lot of long-needed but painful clean up. This has resulted in cases where not too old releases of downstream projects are breaking against the most recently released SciPy.

Additionally, requiring PyArrow would simplify the related development within pandas and potentially improve NumPy functionality that would be better suited
by PyArrow including:

- Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any small code samples we can add to drive this point home? I think still we would make a runtime determination whether to return a pyarrow or numpy-backed object even if both are installed, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure this comment by Will has been addressed (unless I missed it?)

to make it easier to find: the link is here, and says:

Are there any small code samples we can add to drive this point home? I think still we would make a runtime determination whether to return a pyarrow or numpy-backed object even if both are installed, no?

This PDEP proposes that:

- PyArrow becomes a runtime dependency starting pandas 2.1
- The minimum version of PyArrow supported starting pandas 2.1 is version 7.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this version be consistent across the entire pandas API?

e.g. If I wanted to bump the pyarrow version for just the CSV parser to something higher, would I be able to do it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimum version would be consistent across the library, but IMO that shouldn't stop development of features that exist in newer versions of pyarrow (we already do this with version checking or try/except)

@attack68
Copy link
Contributor

In one of the linked issues there is the comment about pandas being backend agnostic. Is it possible that this PDEP can broach that and consider how making pyarrow a dependency defines that objective as being unacheivable or how it would fit in to such a concept. This is not clear to me, or whether that is indeed a goal.

Primarily I have used numpy for linear algebra and pandas as an extension for better indexing, data wrangling and output. the added bloat when incorporating into web apps with limited build space is concerning.

@phofl
Copy link
Member

phofl commented Apr 20, 2023

This isn't a really useful discussion to have right now (and IMO shouldn't be in scope here). Even if we get to a point where NumPy is optional (this is years away), NumPy is still a required dependency of PyArrow

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some comments, but I'd really like to understand the burden on development if we left things as they are in terms of it being an optional dependency. How much simpler does the code base become if PyArrow is required?

I'm also concerned about the lack of support on Alpine Linux for PyArrow, and maybe we should wait for that before accepting this PDEP.

web/pandas/pdeps/0010-required-pyarrow-dependency.md Outdated Show resolved Hide resolved
web/pandas/pdeps/0010-required-pyarrow-dependency.md Outdated Show resolved Hide resolved
web/pandas/pdeps/0010-required-pyarrow-dependency.md Outdated Show resolved Hide resolved
web/pandas/pdeps/0010-required-pyarrow-dependency.md Outdated Show resolved Hide resolved
web/pandas/pdeps/0010-required-pyarrow-dependency.md Outdated Show resolved Hide resolved
@rhshadrach
Copy link
Member

Is it a good comparison to say that trying to develop arrow-backed arrays without pyarrow as a dependency is like trying to develop NumPy-backed arrays/blocks without NumPy as a dependency?

@phofl
Copy link
Member

phofl commented Jul 4, 2023

date and times are 2 additional dtypes that we can infer properly. Added them to the list as well.

numpy.void is not a drop-in replacement for PyArrow structs, even if we would implement it.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jul 5, 2023

I understand, but doing it this way, we don't know how many people are neutral or happy with it. Even if there's a large volume of criticism, there could be an even larger volume that are neutral or happy with better perf that are probably not going to go out of their way to make their voices heard. For example, if a million people voice criticism, it would seem like a lot, but maybe not if there were 5 million people happy with it or neutral towards it (there's no one size fits all here), but we have no idea what this number would be. And that's the point I am trying to make, this issue on the warning wouldn't gather any data on the flip side of the issue (which I think is equally important), which imo would skew the results, or make them seem worse than they really might be.

IMHO, it's better that we get some feedback, rather than none. The wording as proposed doesn't commit us to saying we will not require pyarrow if we get negative feedback - it just says that we will get the feedback, which gives us the possibility of delaying the requirement based on that feedback.

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment about the word "horrendous"

web/pandas/pdeps/0010-required-pyarrow-dependency.md Outdated Show resolved Hide resolved
@phofl
Copy link
Member

phofl commented Jul 13, 2023

Reworded the horrendous thing. Let's start the voting now

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, let's do this. After recent discussions and clarifications, I'm sold: mainly based on superseding object dtype in string, list, and struct dtypes, which would be a real and immediate benefit to users

To emphasise: if accepted, this would not change the default for dtypes which are currently non-object numpy dtypes (that would be a separate discussion). It would only change the default for dtypes which are currently object (which I don't think I'm alone in considering an embarrassment)

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jul 13, 2023

Reworded the horrendous thing. Let's start the voting now

Agreed. I'd like to propose that we use the newly proposed voting procedure for PDEP-1 on this. I've created an issue at #54106 for the core team to place their votes.

@phofl
Copy link
Member

phofl commented Jul 13, 2023

Not sure whether I mentioned this here already: Some reading material with Benchmarks for Arrow support/strings

https://medium.com/p/2891d3d96d2b

@phofl
Copy link
Member

phofl commented Jul 30, 2023

This was accepted in our vote. So updated status and will merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Dependencies Required and optional dependencies PDEP pandas enhancement proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make pyarrow a required dependency