Writing partitioned Parquet files to S3 has weird memory consumption #14769

gitriff · 2024-02-29T13:00:20Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

When I read batched csv and write to local file system, like this

reader = pl.read_csv_batched("truncated.csv",
        batch_size=10000,
    )    

    batches = reader.next_batches(1)
    while batches:
        batches[0].write_parquet(
            'out',
            use_pyarrow=True,
            pyarrow_options={"partition_cols": ["country"] },
        )

        batches = reader.next_batches(1)

I get this memory usage

But if I do exactly the same but write to S3

reader = pl.read_csv_batched("truncated.csv",
        batch_size=10000,
    )    

    batches = reader.next_batches(1)
    while batches:
        batches[0].write_parquet(
            's3://my-bucket/my-prefix/',
            use_pyarrow=True,
            pyarrow_options={"partition_cols": ["country"] },
        )

        batches = reader.next_batches(1)

I get:

So the memory usage is over 5 times more when writing to S3 than when writing to local file system.

The graphs have beed produced with https://github.com/pythonprofilers/memory_profiler and truncated.csv is first 2000000 lines of https://www.kaggle.com/datasets/peopledatalabssf/free-7-million-company-dataset?resource=download

Am I doing something obviosly wrong here, or any ideas what is happening.

Log output

I didn't get anything to stderr

Issue description

Memory usage when writing partitioned parquet files to S3 is lot larger than writing the same data to local file system

Expected behavior

I expect the memory usage to be the same in both cases

Installed versions

>>> polars.show_versions()
--------Version info---------
Polars:               0.20.13
Index type:           UInt32
Platform:             macOS-14.1-arm64-arm-64bit
Python:               3.10.13 (main, Feb 19 2024, 13:05:31) [Clang 15.0.0 (clang-1500.1.0.2.5)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2024.2.0
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.3
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.0
pyarrow:              15.0.0
pydantic:             2.6.1
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>```

</details>

The text was updated successfully, but these errors were encountered:

gitriff added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing partitioned Parquet files to S3 has weird memory consumption #14769

Writing partitioned Parquet files to S3 has weird memory consumption #14769

gitriff commented Feb 29, 2024

Writing partitioned Parquet files to S3 has weird memory consumption #14769

Writing partitioned Parquet files to S3 has weird memory consumption #14769

Comments

gitriff commented Feb 29, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions