Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LazyFrame OOM on CSVs with rows that contains large amount of data #17354

Open
2 tasks done
TomHaii opened this issue Jul 2, 2024 · 2 comments
Open
2 tasks done

LazyFrame OOM on CSVs with rows that contains large amount of data #17354

TomHaii opened this issue Jul 2, 2024 · 2 comments
Assignees
Labels
A-io-csv Area: reading/writing CSV files A-streaming Related to the streaming engine accepted Ready for implementation enhancement New feature or an improvement of an existing feature P-goal Priority: aligns with long-term Polars goals performance Performance issues or improvements python Related to Python Polars

Comments

@TomHaii
Copy link

TomHaii commented Jul 2, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import csv

def gen_csv():
    with open('test.csv', 'w') as f:
        writer = csv.writer(f)
        writer.writerow(['Id', 'BigRow'])
        for i in range(1, 1000000):
            writer.writerow([i, 'x' * (10 ** 4)])

lz = pl.scan_csv('test.csv')
q = lz.filter(pl.col('Id') % 2 == 0)
q.sink_csv('result.csv')

Log output

No response

Issue description

Hello, I am opening this issue after encountering OOM issue with running filters/sql context queries on CSV datasets in which each individual row contains large amount of data.

I gave an reproducible example that worked on my 6GB RAM machine.
When the row contains a data small than 1 KB, I can even compute datasets larger than 100GB. As soon as the row gets quite large (can't point on the exact number here, but that happened with 2KB for each row), an OOM is starting to occur.

Expected behavior

LazyFrame computing should work even in cases that each row contains larger amount of data

Installed versions

;--------Version info---------
Polars:               1.0.0
Index type:           UInt32
Platform:             macOS-14.5-arm64-arm-64bit
Python:               3.10.14 (main, Mar 19 2024, 21:46:16) [Clang 15.0.0 (clang-1500.3.9.4)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.5.0
gevent:               23.9.1
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                1.26.0
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             1.10.15
pyiceberg:            <not installed>
sqlalchemy:           1.2.19
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@TomHaii TomHaii added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 2, 2024
@stinodego stinodego added performance Performance issues or improvements A-streaming Related to the streaming engine labels Jul 2, 2024
@nameexhaustion nameexhaustion added enhancement New feature or an improvement of an existing feature accepted Ready for implementation P-goal Priority: aligns with long-term Polars goals A-io-csv Area: reading/writing CSV files and removed bug Something isn't working needs triage Awaiting prioritization by a maintainer labels Jul 8, 2024
@nameexhaustion
Copy link
Collaborator

It won't be trivial to make, but we can support this one day with a low-memory CSV reader.

@nameexhaustion nameexhaustion self-assigned this Jul 13, 2024
@ritchie46
Copy link
Member

I would put this as a low prio right now @nameexhaustion. This can be added later in the streaming engine when we determine optimal chunk sizes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-csv Area: reading/writing CSV files A-streaming Related to the streaming engine accepted Ready for implementation enhancement New feature or an improvement of an existing feature P-goal Priority: aligns with long-term Polars goals performance Performance issues or improvements python Related to Python Polars
Projects
Status: Ready
Development

No branches or pull requests

4 participants