Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing partitioned Parquet files to S3 has weird memory consumption #14769

Open
2 tasks done
gitriff opened this issue Feb 29, 2024 · 0 comments
Open
2 tasks done

Writing partitioned Parquet files to S3 has weird memory consumption #14769

gitriff opened this issue Feb 29, 2024 · 0 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@gitriff
Copy link

gitriff commented Feb 29, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

When I read batched csv and write to local file system, like this

reader = pl.read_csv_batched("truncated.csv",
        batch_size=10000,
    )    

    batches = reader.next_batches(1)
    while batches:
        batches[0].write_parquet(
            'out',
            use_pyarrow=True,
            pyarrow_options={"partition_cols": ["country"] },
        )

        batches = reader.next_batches(1)

I get this memory usage
write_local

But if I do exactly the same but write to S3

reader = pl.read_csv_batched("truncated.csv",
        batch_size=10000,
    )    

    batches = reader.next_batches(1)
    while batches:
        batches[0].write_parquet(
            's3://my-bucket/my-prefix/',
            use_pyarrow=True,
            pyarrow_options={"partition_cols": ["country"] },
        )

        batches = reader.next_batches(1)

I get:

write_s3

So the memory usage is over 5 times more when writing to S3 than when writing to local file system.

The graphs have beed produced with https://github.com/pythonprofilers/memory_profiler and truncated.csv is first 2000000 lines of https://www.kaggle.com/datasets/peopledatalabssf/free-7-million-company-dataset?resource=download

Am I doing something obviosly wrong here, or any ideas what is happening.

Log output

I didn't get anything to stderr

Issue description

Memory usage when writing partitioned parquet files to S3 is lot larger than writing the same data to local file system

Expected behavior

I expect the memory usage to be the same in both cases

Installed versions

>>> polars.show_versions()
--------Version info---------
Polars:               0.20.13
Index type:           UInt32
Platform:             macOS-14.1-arm64-arm-64bit
Python:               3.10.13 (main, Feb 19 2024, 13:05:31) [Clang 15.0.0 (clang-1500.1.0.2.5)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2024.2.0
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.3
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.0
pyarrow:              15.0.0
pydantic:             2.6.1
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>```

</details>
@gitriff gitriff added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

1 participant