Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv_batched not working when separator is included in the field #16953

Open
2 tasks done
MikeXydas opened this issue Jun 14, 2024 · 0 comments
Open
2 tasks done

read_csv_batched not working when separator is included in the field #16953

MikeXydas opened this issue Jun 14, 2024 · 0 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@MikeXydas
Copy link

MikeXydas commented Jun 14, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

# error.csv
a,b
"test","test"
"test","test"
"test",",,"  # Notice here that we have 2 commas

# correct.csv
a,b
"test","test"
"test","test"
"test",","  # Here we only have 1 comma
import polars as pl

reader_error = pl.read_csv_batched("error.csv", separator=",", batch_size=1, quote_char="\"")
batch = reader_error.next_batches(2)
print(len(batch))  # Prints 1, wrong
print(batch)

reader_correct = pl.read_csv_batched("correct.csv", separator=",", batch_size=1, quote_char="\"")
batch = reader_correct.next_batches(2)
print(len(batch))  # Prints 2, correct
print(batch)

Log output

1
[shape: (3, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ testtest │
│ testtest │
│ test ┆ ,,   │
└──────┴──────┘]
2
[shape: (1, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ testtest │
└──────┴──────┘, shape: (1, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ testtest │
└──────┴──────┘]

Issue description

I am trying to read a relatively large csv +30M rows that cannot fit into memory so I am using read_csv_batched. However, I noticed that reader.next_batch(5) instead of returning number of batches dfs (in our case 5) it always returned 1 df with all the rows inside (bigger than the given batch size).

The issue seems to occur due to the , character but since we are using " it should be escaped and not affect the batch reader.
Note that this is a minimum example. In the real scenario we had batch_size = 100,000 and still the whole csv was read in a single DataFrame of 30M rows.

(Posted in SO first: https://stackoverflow.com/questions/78616907/polars-issue-with-read-csv-batched-when-separator-is-included-in-the-field)

Expected behavior

The expected behavior should be the one shown in the correct.csv example where 2 batches of size 1 are created:

[shape: (1, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ test ┆ test │
└──────┴──────┘, 
shape: (1, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ test ┆ test │
└──────┴──────┘]

Installed versions

--------Version info---------
Polars:               0.20.31
Index type:           UInt32
Platform:             Linux-6.5.0-35-generic-x86_64-with-glibc2.35
Python:               3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2023.10.0
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
nest_asyncio:         1.6.0
numpy:                1.26.3
openpyxl:             <not installed>
pandas:               2.2.0
pyarrow:              15.0.0
pydantic:             2.6.0
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.27
torch:                2.2.0+cu121
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>```

</details>
@MikeXydas MikeXydas added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

1 participant