Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

from_arrow is not zero-cost #17409

Closed
2 tasks done
useredsa opened this issue Jul 3, 2024 · 2 comments
Closed
2 tasks done

from_arrow is not zero-cost #17409

useredsa opened this issue Jul 3, 2024 · 2 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@useredsa
Copy link

useredsa commented Jul 3, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import pyarrow as pa
import resource
df: pa.Table = query_online(...) # 4 GB
print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1e6, 'GB') # ~6 GB
df = pl.from_arrow(df, rechunk=False) # 4 GB
print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1e6, 'GB') # ~8 GB

Log output

No response

Issue description

I have an arrow table fetched from a cloud service. The table is 4 GB. I measure max resident memory usage after the download and it's 6 GB (the process needs some more extra memory). Then I convert it with from_arrow and the max resident memory usage spikes to 8 GB, implying that the DF was copied.

The data types in the df are 64-bit floats and utf8 strings. It has 80M entries and 7 columns.

Expected behavior

I expect zero cost copies.

Installed versions

--------Version info---------
Polars:               1.0.0
Index type:           UInt32
Platform:             Linux-6.5.1-41-generic-x86_64-with-glibc2.35
Python:               3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@useredsa useredsa added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 3, 2024
@stinodego
Copy link
Member

stinodego commented Jul 3, 2024

Converting from Arrow is not always zero copy. We have a different string representation than what most existing Arrow implementations have. So the behavior here is expected.

@stinodego stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Jul 3, 2024
@useredsa
Copy link
Author

useredsa commented Jul 4, 2024

Hi, @stinodego,

I still have the following questions:

  1. If it's like that, maybe the documentation should be explicit about that, no? I think string is a pretty common type and I think one would understand that the conversion is zero-cost.
  2. In this example it's implied that the whole dataframe is copied. Because the memory required is double the dataframe size. If it's because of what you say, shouldn't it be only the string columns?
  3. Is there anything we can do to circumvent this? Like using certain data type with pyarrow.
  4. Will something similar happen with categories? Or is converting to categories first a good alternative if the number of different values of the string columns are small.

Thanks in advance,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants