Writing partitioned Parquet files to S3 has weird memory consumption #14769
Labels
bug
Something isn't working
needs triage
Awaiting prioritization by a maintainer
python
Related to Python Polars
Checks
Reproducible example
When I read batched csv and write to local file system, like this
I get this memory usage
But if I do exactly the same but write to S3
I get:
So the memory usage is over 5 times more when writing to S3 than when writing to local file system.
The graphs have beed produced with https://github.com/pythonprofilers/memory_profiler and
truncated.csv
is first 2000000 lines of https://www.kaggle.com/datasets/peopledatalabssf/free-7-million-company-dataset?resource=downloadAm I doing something obviosly wrong here, or any ideas what is happening.
Log output
I didn't get anything to stderr
Issue description
Memory usage when writing partitioned parquet files to S3 is lot larger than writing the same data to local file system
Expected behavior
I expect the memory usage to be the same in both cases
Installed versions
The text was updated successfully, but these errors were encountered: