Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using scan_csv with with_column_names argument disables schema validation #17374

Open
2 tasks done
hotaru355 opened this issue Jul 2, 2024 · 0 comments
Open
2 tasks done
Labels
A-io-csv Area: reading/writing CSV files bug Something isn't working P-low Priority: low python Related to Python Polars

Comments

@hotaru355
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Given the following CSV file:

col1
notAnInt

This query should raise an exception, as the col1 column is not a pl.Int16.

pl.scan_csv(
    "test.csv",
    schema={"col1": pl.Int16},
    with_column_names=lambda names: names,
).sink_csv("out1.csv")

Log output

RUN STREAMING PIPELINE
[csv -> parquet_sink]
STREAMING CHUNK SIZE: 50000 rows

Issue description

When using the scan_csv query with the with_column_names argument, any schema validation enforced by the schema argument is disabled. Simply removing the with_column_names argument from the query enables validation again.

Expected behavior

This query should raise an exception just as it does when omitting the with_column_names argument:

polars.exceptions.ComputeError: could not parse `notAnInt` as dtype `i16` at column 'col1' (column number 1)

Installed versions

--------Version info---------
Polars:               1.0.0
Index type:           UInt32
Platform:             Linux-6.9.3-76060903-generic-x86_64-with-glibc2.35
Python:               3.11.7 (main, Feb 12 2024, 10:41:42) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           3.7.5
nest_asyncio:         <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              <not installed>
pydantic:             1.10.15
pyiceberg:            <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@hotaru355 hotaru355 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 2, 2024
@stinodego stinodego added P-low Priority: low A-io-csv Area: reading/writing CSV files and removed needs triage Awaiting prioritization by a maintainer labels Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-csv Area: reading/writing CSV files bug Something isn't working P-low Priority: low python Related to Python Polars
Projects
Status: Ready
Development

No branches or pull requests

2 participants