PyArrowOnRay: implement read_parquet??? #5523

swamidass · 2023-01-05T07:19:40Z

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. What kind of performance improvements would you like to see with this new API?

Right now it appears the only way to create a dataframe with the PyArrowOnRay engine is to read it in from CSV. This is very unfortunate, and makes the module essentially unusable.

In my case, I'm using lists of integers for certain columns, which PyArrow handles beautifully as a segmented array, but pandas encodes as objects, thereby exploding processing time and memory usage. PyArrow's format is highly desirable for this reason, as would reading from feather/parquet (which natively support this format, unlike csv).

I understand that PyArrowOnRay is an experimental feature, but would it be possible to implement a parquet reader? This would enable testing on a key use case as PyArrowOnRay is developed further.

Thanks!

vnlitvinov · 2023-01-05T16:54:56Z

Hi @swamidass, it should certainly be possible to do so.

I'm not sure how long it would take, so let me ping @pyrito who did the last works in read_parquet in Modin 🙃

swamidass · 2023-01-13T21:34:07Z

Thanks for putting it on the list. Looking forward to being able to test it out. Moving to a pyarrow backend will be really nice, with many use cases and performance benefits.

anmyachev · 2024-01-25T18:59:40Z

PyarrowOnRay is considered deprecated and has been removed in #6848.

swamidass added new feature/request 💬 Requests and pull requests for new features Triage 🩹 Issues that need triage labels Jan 5, 2023

vnlitvinov added pandas.io P1 Important tasks that we should complete soon External Pull requests and issues from people who do not regularly contribute to modin and removed Triage 🩹 Issues that need triage labels Jan 5, 2023

anmyachev closed this as completed Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyArrowOnRay: implement read_parquet??? #5523

PyArrowOnRay: implement read_parquet??? #5523

swamidass commented Jan 5, 2023

vnlitvinov commented Jan 5, 2023 •

edited

swamidass commented Jan 13, 2023

anmyachev commented Jan 25, 2024

PyArrowOnRay: implement read_parquet??? #5523

PyArrowOnRay: implement read_parquet??? #5523

Comments

swamidass commented Jan 5, 2023

vnlitvinov commented Jan 5, 2023 • edited

swamidass commented Jan 13, 2023

anmyachev commented Jan 25, 2024

vnlitvinov commented Jan 5, 2023 •

edited