-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading CSV, Polars seems to ignore the provided Schema #15254
Comments
Have you tried setting the full schema? Setting only We accept a |
If I understand correctly, this appears to be a minimal repro? Data: wget https://nemweb.com.au/Reports/Current/Daily_Reports/PUBLIC_DAILY_202401270000_20240128040505.zip
unzip PUBLIC_DAILY_202401270000_20240128040505.zip
import polars as pl
pl.read_csv(
"PUBLIC_DAILY_202401270000_20240128040505.CSV",
skip_rows=1,
schema=dict.fromkeys([f"column_{n}" for n in range(1, 131)], pl.String),
has_header=False
)
# shape: (424_519, 130) # <- OK: 130 Columns
# ... The same arguments with # ComputeError: found more fields than defined in 'Schema'
# Consider setting 'truncate_ragged_lines=True'. With pl.scan_csv(
"PUBLIC_DAILY_202401270000_20240128040505.CSV",
skip_rows=1,
schema=dict.fromkeys([f"column_{n}" for n in range(1, 131)], pl.String),
has_header=False,
truncate_ragged_lines=True
).collect()
# shape: (424_519, 25) # <- ERROR!!! 25 Columns
# ... (With |
I suppose the data is not actually needed. Simpler repro: import polars as pl
import tempfile
with tempfile.NamedTemporaryFile() as f:
f.write(b"""
A,B,C
1,2,3
4,5,6,7,8
9,10,11
""".strip())
f.seek(0)
df = pl.read_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True)
# shape: (3, 5)
# ┌─────┬─────┬─────┬──────┬──────┐
# │ A ┆ B ┆ C ┆ D ┆ E │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str ┆ str ┆ str │
# ╞═════╪═════╪═════╪══════╪══════╡
# │ 1 ┆ 2 ┆ 3 ┆ null ┆ null │
# │ 4 ┆ 5 ┆ 6 ┆ 7 ┆ 8 │
# │ 9 ┆ 10 ┆ 11 ┆ null ┆ null │
# └─────┴─────┴─────┴──────┴──────┘
lf = pl.scan_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True).collect()
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ A ┆ B ┆ C │
# │ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str │
# ╞═════╪═════╪═════╡
# │ 1 ┆ 2 ┆ 3 │
# │ 4 ┆ 5 ┆ 6 │
# │ 9 ┆ 10 ┆ 11 │
# └─────┴─────┴─────┘ |
When passed in as dtypes, the schema inference is not skipped. That has the side effect that only the first `n` columns from the passed-in schema are eventually used (see the `infer_file_schema_inner` method in the `polars-io/src/csv/utils.rs` file). With the change `scan_csv` behaves the same as `read_csv` when used with a schema having more columns that the file header: ```python with tempfile.NamedTemporaryFile() as f: f.write(b""" A,B,C 1,2,3 4,5,6,7,8 9,10,11 """.strip()) f.seek(0) df = pl.read_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True) print(df) lf = pl.scan_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True).collect() print(lf) ... >>> check() shape: (3, 5) ┌─────┬─────┬─────┬──────┬──────┐ │ A ┆ B ┆ C ┆ D ┆ E │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ str ┆ str │ ╞═════╪═════╪═════╪══════╪══════╡ │ 1 ┆ 2 ┆ 3 ┆ null ┆ null │ │ 4 ┆ 5 ┆ 6 ┆ 7 ┆ 8 │ │ 9 ┆ 10 ┆ 11 ┆ null ┆ null │ └─────┴─────┴─────┴──────┴──────┘ shape: (3, 5) ┌─────┬─────┬─────┬──────┬──────┐ │ A ┆ B ┆ C ┆ D ┆ E │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ str ┆ str │ ╞═════╪═════╪═════╪══════╪══════╡ │ 1 ┆ 2 ┆ 3 ┆ null ┆ null │ │ 4 ┆ 5 ┆ 6 ┆ 7 ┆ 8 │ │ 9 ┆ 10 ┆ 11 ┆ null ┆ null │ └─────┴─────┴─────┴──────┴──────┘ ```
Hi, |
The documentation for |
Any updates on this? |
It looks like @filabrazilska did file a PR to address this #15305 But it hasn't been reviewed yet. (PRs can be linked to issues with keywords: https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword) |
When passed in as dtypes, the schema inference is not skipped. That has the side effect that only the first `n` columns from the passed-in schema are eventually used (see the `infer_file_schema_inner` method in the `polars-io/src/csv/utils.rs` file). With the change `scan_csv` behaves the same as `read_csv` when used with a schema having more columns that the file header: ```python with tempfile.NamedTemporaryFile() as f: f.write(b""" A,B,C 1,2,3 4,5,6,7,8 9,10,11 """.strip()) f.seek(0) df = pl.read_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True) print(df) lf = pl.scan_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True).collect() print(lf) ... >>> check() shape: (3, 5) ┌─────┬─────┬─────┬──────┬──────┐ │ A ┆ B ┆ C ┆ D ┆ E │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ str ┆ str │ ╞═════╪═════╪═════╪══════╪══════╡ │ 1 ┆ 2 ┆ 3 ┆ null ┆ null │ │ 4 ┆ 5 ┆ 6 ┆ 7 ┆ 8 │ │ 9 ┆ 10 ┆ 11 ┆ null ┆ null │ └─────┴─────┴─────┴──────┴──────┘ shape: (3, 5) ┌─────┬─────┬─────┬──────┬──────┐ │ A ┆ B ┆ C ┆ D ┆ E │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ str ┆ str │ ╞═════╪═════╪═════╪══════╪══════╡ │ 1 ┆ 2 ┆ 3 ┆ null ┆ null │ │ 4 ┆ 5 ┆ 6 ┆ 7 ┆ 8 │ │ 9 ┆ 10 ┆ 11 ┆ null ┆ null │ └─────┴─────┴─────┴──────┴──────┘ ```
When passed in as dtypes, the schema inference is not skipped. That has the side effect that only the first `n` columns from the passed-in schema are eventually used (see the `infer_file_schema_inner` method in the `polars-io/src/csv/utils.rs` file). With the change `scan_csv` behaves the same as `read_csv` when used with a schema having more columns that the file header: ```python with tempfile.NamedTemporaryFile() as f: f.write(b""" A,B,C 1,2,3 4,5,6,7,8 9,10,11 """.strip()) f.seek(0) df = pl.read_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True) print(df) lf = pl.scan_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True).collect() print(lf) ... >>> check() shape: (3, 5) ┌─────┬─────┬─────┬──────┬──────┐ │ A ┆ B ┆ C ┆ D ┆ E │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ str ┆ str │ ╞═════╪═════╪═════╪══════╪══════╡ │ 1 ┆ 2 ┆ 3 ┆ null ┆ null │ │ 4 ┆ 5 ┆ 6 ┆ 7 ┆ 8 │ │ 9 ┆ 10 ┆ 11 ┆ null ┆ null │ └─────┴─────┴─────┴──────┴──────┘ shape: (3, 5) ┌─────┬─────┬─────┬──────┬──────┐ │ A ┆ B ┆ C ┆ D ┆ E │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ str ┆ str │ ╞═════╪═════╪═════╪══════╪══════╡ │ 1 ┆ 2 ┆ 3 ┆ null ┆ null │ │ 4 ┆ 5 ┆ 6 ┆ 7 ┆ 8 │ │ 9 ┆ 10 ┆ 11 ┆ null ┆ null │ └─────┴─────┴─────┴──────┴──────┘ ```
Thanks for the report and thank you for the minimal repro @cmdlineluser. Taking a look. |
@ritchie46 there is a regression with the latest update
-> 1817 return wrap_df(ldf.collect(callback)) PanicException: called |
Thanks @djouallah - it seems that was a different issue. If you can make minimal test cases, it makes it easier for the devs to fix. As an example, I made a minimal repro for you in #16437 It has just been fixed and will be part of |
Checks
Reproducible example
I am attaching a full reproducible example here,
https://github.com/djouallah/Light_ETL_Challenge/blob/main/Light_ETL_Challenge.ipynb
Log output
No response
Issue description
basically, there are more than 25 columns, but Polars so far seems to account only for the total columns founds at row =1
Expected behavior
Polars should read a csv with variable column numbers
Installed versions
Python polars-0.20.16
The text was updated successfully, but these errors were encountered: