LazyFrame OOM on CSVs with rows that contains large amount of data #17354

TomHaii · 2024-07-02T11:59:33Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import csv

def gen_csv():
    with open('test.csv', 'w') as f:
        writer = csv.writer(f)
        writer.writerow(['Id', 'BigRow'])
        for i in range(1, 1000000):
            writer.writerow([i, 'x' * (10 ** 4)])

lz = pl.scan_csv('test.csv')
q = lz.filter(pl.col('Id') % 2 == 0)
q.sink_csv('result.csv')

Log output

No response

Issue description

Hello, I am opening this issue after encountering OOM issue with running filters/sql context queries on CSV datasets in which each individual row contains large amount of data.

I gave an reproducible example that worked on my 6GB RAM machine.
When the row contains a data small than 1 KB, I can even compute datasets larger than 100GB. As soon as the row gets quite large (can't point on the exact number here, but that happened with 2KB for each row), an OOM is starting to occur.

Expected behavior

LazyFrame computing should work even in cases that each row contains larger amount of data

Installed versions

;--------Version info---------
Polars:               1.0.0
Index type:           UInt32
Platform:             macOS-14.5-arm64-arm-64bit
Python:               3.10.14 (main, Mar 19 2024, 21:46:16) [Clang 15.0.0 (clang-1500.3.9.4)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.5.0
gevent:               23.9.1
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                1.26.0
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             1.10.15
pyiceberg:            <not installed>
sqlalchemy:           1.2.19
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

nameexhaustion · 2024-07-08T03:02:36Z

It won't be trivial to make, but we can support this one day with a low-memory CSV reader.

ritchie46 · 2024-07-14T10:36:18Z

I would put this as a low prio right now @nameexhaustion. This can be added later in the streaming engine when we determine optimal chunk sizes.

TomHaii added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 2, 2024

stinodego added performance Performance issues or improvements A-streaming Related to the streaming engine labels Jul 2, 2024

nameexhaustion self-assigned this Jul 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LazyFrame OOM on CSVs with rows that contains large amount of data #17354

LazyFrame OOM on CSVs with rows that contains large amount of data #17354

TomHaii commented Jul 2, 2024 •

edited

Loading

nameexhaustion commented Jul 8, 2024

ritchie46 commented Jul 14, 2024

LazyFrame OOM on CSVs with rows that contains large amount of data #17354

LazyFrame OOM on CSVs with rows that contains large amount of data #17354

Comments

TomHaii commented Jul 2, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

nameexhaustion commented Jul 8, 2024

ritchie46 commented Jul 14, 2024

TomHaii commented Jul 2, 2024 •

edited

Loading