Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyArrowOnRay: implement read_parquet??? #5523

Closed
swamidass opened this issue Jan 5, 2023 · 3 comments
Closed

PyArrowOnRay: implement read_parquet??? #5523

swamidass opened this issue Jan 5, 2023 · 3 comments
Labels
External Pull requests and issues from people who do not regularly contribute to modin new feature/request 💬 Requests and pull requests for new features P1 Important tasks that we should complete soon pandas.io

Comments

@swamidass
Copy link

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. What kind of performance improvements would you like to see with this new API?

Right now it appears the only way to create a dataframe with the PyArrowOnRay engine is to read it in from CSV. This is very unfortunate, and makes the module essentially unusable.

In my case, I'm using lists of integers for certain columns, which PyArrow handles beautifully as a segmented array, but pandas encodes as objects, thereby exploding processing time and memory usage. PyArrow's format is highly desirable for this reason, as would reading from feather/parquet (which natively support this format, unlike csv).

I understand that PyArrowOnRay is an experimental feature, but would it be possible to implement a parquet reader? This would enable testing on a key use case as PyArrowOnRay is developed further.

Thanks!

@swamidass swamidass added new feature/request 💬 Requests and pull requests for new features Triage 🩹 Issues that need triage labels Jan 5, 2023
@vnlitvinov vnlitvinov added pandas.io P1 Important tasks that we should complete soon External Pull requests and issues from people who do not regularly contribute to modin and removed Triage 🩹 Issues that need triage labels Jan 5, 2023
@vnlitvinov
Copy link
Collaborator

vnlitvinov commented Jan 5, 2023

Hi @swamidass, it should certainly be possible to do so.

I'm not sure how long it would take, so let me ping @pyrito who did the last works in read_parquet in Modin 🙃

@swamidass
Copy link
Author

Thanks for putting it on the list. Looking forward to being able to test it out. Moving to a pyarrow backend will be really nice, with many use cases and performance benefits.

@anmyachev
Copy link
Collaborator

PyarrowOnRay is considered deprecated and has been removed in #6848.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
External Pull requests and issues from people who do not regularly contribute to modin new feature/request 💬 Requests and pull requests for new features P1 Important tasks that we should complete soon pandas.io
Projects
None yet
Development

No branches or pull requests

3 participants