Reading CSV, Polars seems to ignore the provided Schema #15254

djouallah · 2024-03-23T13:04:53Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

I am attaching a full reproducible example here,

https://github.com/djouallah/Light_ETL_Challenge/blob/main/Light_ETL_Challenge.ipynb

colmn=[
'I','UNIT','XX','VERSION','SETTLEMENTDATE','RUNNO',
    'DUID','INTERVENTION','DISPATCHMODE','AGCSTATUS','INITIALMW',
    'TOTALCLEARED','RAMPDOWNRATE','RAMPUPRATE','LOWER5MIN',
    'LOWER60SEC','LOWER6SEC','RAISE5MIN','RAISE60SEC',
    'RAISE6SEC','MARGINAL5MINVALUE','MARGINAL60SECVALUE',
    'MARGINAL6SECVALUE','MARGINALVALUE','VIOLATION5MINDEGREE'
]
raw = pl.scan_csv(f'{raw_landing}/csv/*.CSV',skip_rows=1, new_columns=colmn,has_header=False, infer_schema_length=0,truncate_ragged_lines=True)
transform =(
    raw
    .filter(pl.col("I")=='D')
    .filter(pl.col("UNIT")=='DUNIT')
    .filter(pl.col("VERSION")=='3')
    .drop("XX")
    .drop("I")
)

Log output

No response

Issue description

basically, there are more than 25 columns, but Polars so far seems to account only for the total columns founds at row =1

Expected behavior

Polars should read a csv with variable column numbers

Installed versions

Python polars-0.20.16

The text was updated successfully, but these errors were encountered:

ritchie46 · 2024-03-23T13:14:19Z

Have you tried setting the full schema? Setting only columns will set the names of the schema, but Polars will still determine the schema itself, which in this case will likely be done based on the header.

We accept a schema argument that will help you completely overwrite the schema.

djouallah · 2024-03-23T13:47:13Z

does not seems to be working, it is still ignoring the schema, it still provide a header, column_1 , 2 etc based on the number of columns in row=1

cmdlineluser · 2024-03-23T16:12:54Z

If I understand correctly, this appears to be a minimal repro?

Data:

wget https://nemweb.com.au/Reports/Current/Daily_Reports/PUBLIC_DAILY_202401270000_20240128040505.zip
unzip PUBLIC_DAILY_202401270000_20240128040505.zip

.read_csv works as expected.

import polars as pl

pl.read_csv(
    "PUBLIC_DAILY_202401270000_20240128040505.CSV",
    skip_rows=1,
    schema=dict.fromkeys([f"column_{n}" for n in range(1, 131)], pl.String),
    has_header=False
)
# shape: (424_519, 130) # <- OK: 130 Columns
# ...

The same arguments with .scan_csv raises an exception:

# ComputeError: found more fields than defined in 'Schema'
# Consider setting 'truncate_ragged_lines=True'.

With truncate_ragged_lines=True the file is read but we no longer get 130 columns.

pl.scan_csv(
    "PUBLIC_DAILY_202401270000_20240128040505.CSV",
    skip_rows=1,
    schema=dict.fromkeys([f"column_{n}" for n in range(1, 131)], pl.String),
    has_header=False,
    truncate_ragged_lines=True
).collect()

# shape: (424_519, 25) # <- ERROR!!! 25 Columns
# ...

(With read_csv(..., truncate_ragged_lines=True) we still get 130 Columns.)

cmdlineluser · 2024-03-24T12:04:59Z

I suppose the data is not actually needed.

Simpler repro:

import polars as pl
import tempfile

with tempfile.NamedTemporaryFile() as f:
    f.write(b"""
A,B,C
1,2,3
4,5,6,7,8
9,10,11
""".strip())
    f.seek(0)
    
    df = pl.read_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True)
    # shape: (3, 5)
    # ┌─────┬─────┬─────┬──────┬──────┐
    # │ A   ┆ B   ┆ C   ┆ D    ┆ E    │
    # │ --- ┆ --- ┆ --- ┆ ---  ┆ ---  │
    # │ str ┆ str ┆ str ┆ str  ┆ str  │
    # ╞═════╪═════╪═════╪══════╪══════╡
    # │ 1   ┆ 2   ┆ 3   ┆ null ┆ null │
    # │ 4   ┆ 5   ┆ 6   ┆ 7    ┆ 8    │
    # │ 9   ┆ 10  ┆ 11  ┆ null ┆ null │
    # └─────┴─────┴─────┴──────┴──────┘
    
    lf = pl.scan_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True).collect()
    # shape: (3, 3)
    # ┌─────┬─────┬─────┐
    # │ A   ┆ B   ┆ C   │
    # │ --- ┆ --- ┆ --- │
    # │ str ┆ str ┆ str │
    # ╞═════╪═════╪═════╡
    # │ 1   ┆ 2   ┆ 3   │
    # │ 4   ┆ 5   ┆ 6   │
    # │ 9   ┆ 10  ┆ 11  │
    # └─────┴─────┴─────┘

When passed in as dtypes, the schema inference is not skipped. That has the side effect that only the first `n` columns from the passed-in schema are eventually used (see the `infer_file_schema_inner` method in the `polars-io/src/csv/utils.rs` file). With the change `scan_csv` behaves the same as `read_csv` when used with a schema having more columns that the file header: ```python with tempfile.NamedTemporaryFile() as f: f.write(b""" A,B,C 1,2,3 4,5,6,7,8 9,10,11 """.strip()) f.seek(0) df = pl.read_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True) print(df) lf = pl.scan_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True).collect() print(lf) ... >>> check() shape: (3, 5) ┌─────┬─────┬─────┬──────┬──────┐ │ A ┆ B ┆ C ┆ D ┆ E │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ str ┆ str │ ╞═════╪═════╪═════╪══════╪══════╡ │ 1 ┆ 2 ┆ 3 ┆ null ┆ null │ │ 4 ┆ 5 ┆ 6 ┆ 7 ┆ 8 │ │ 9 ┆ 10 ┆ 11 ┆ null ┆ null │ └─────┴─────┴─────┴──────┴──────┘ shape: (3, 5) ┌─────┬─────┬─────┬──────┬──────┐ │ A ┆ B ┆ C ┆ D ┆ E │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ str ┆ str │ ╞═════╪═════╪═════╪══════╪══════╡ │ 1 ┆ 2 ┆ 3 ┆ null ┆ null │ │ 4 ┆ 5 ┆ 6 ┆ 7 ┆ 8 │ │ 9 ┆ 10 ┆ 11 ┆ null ┆ null │ └─────┴─────┴─────┴──────┴──────┘ ```

filabrazilska · 2024-03-26T11:51:18Z

Hi,
the difference between read_csv and scan_csv is in the fact that the former uses the passed-in schema for the schema attribute in CsvReader whereas the latter uses is for the dtypes attribute.
When changed the two behave the same (see my commit above). That said I don't know if anyone depends on the original behaviour so not sure if the maintainers are willing the accept this change.

filabrazilska · 2024-03-26T11:52:29Z

The documentation for scan_csv seems to suggest that the schema param should indeed be used for schema rather than dtypes: https://docs.pola.rs/py-polars/html/reference/api/polars.scan_csv.html

djouallah · 2024-04-08T23:47:24Z

Any updates on this?

cmdlineluser · 2024-04-09T15:28:24Z

It looks like @filabrazilska did file a PR to address this #15305

But it hasn't been reviewed yet.

^{(PRs can be linked to issues with keywords: https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword)}

When passed in as dtypes, the schema inference is not skipped. That has the side effect that only the first `n` columns from the passed-in schema are eventually used (see the `infer_file_schema_inner` method in the `polars-io/src/csv/utils.rs` file). With the change `scan_csv` behaves the same as `read_csv` when used with a schema having more columns that the file header: ```python with tempfile.NamedTemporaryFile() as f: f.write(b""" A,B,C 1,2,3 4,5,6,7,8 9,10,11 """.strip()) f.seek(0) df = pl.read_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True) print(df) lf = pl.scan_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True).collect() print(lf) ... >>> check() shape: (3, 5) ┌─────┬─────┬─────┬──────┬──────┐ │ A ┆ B ┆ C ┆ D ┆ E │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ str ┆ str │ ╞═════╪═════╪═════╪══════╪══════╡ │ 1 ┆ 2 ┆ 3 ┆ null ┆ null │ │ 4 ┆ 5 ┆ 6 ┆ 7 ┆ 8 │ │ 9 ┆ 10 ┆ 11 ┆ null ┆ null │ └─────┴─────┴─────┴──────┴──────┘ shape: (3, 5) ┌─────┬─────┬─────┬──────┬──────┐ │ A ┆ B ┆ C ┆ D ┆ E │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ str ┆ str │ ╞═════╪═════╪═════╪══════╪══════╡ │ 1 ┆ 2 ┆ 3 ┆ null ┆ null │ │ 4 ┆ 5 ┆ 6 ┆ 7 ┆ 8 │ │ 9 ┆ 10 ┆ 11 ┆ null ┆ null │ └─────┴─────┴─────┴──────┴──────┘ ```

ritchie46 · 2024-05-06T14:10:53Z

Thanks for the report and thank you for the minimal repro @cmdlineluser. Taking a look.

djouallah · 2024-05-23T11:29:11Z

@ritchie46 there is a regression with the latest update

PanicException                            Traceback (most recent call last)
<timed exec> in <module>

[<ipython-input-11-3bc5859e5af2>](https://localhost:8080/#) in polars_clean_csv(x)
     33   z = transform.with_columns(pl.col("SETTLEMENTDATE").str.to_datetime())
     34   columns = list(set(transform.columns) - {'SETTLEMENTDATE','DUID','UNIT'})
---> 35   final=z.with_columns(pl.col(columns).cast(pl.Float64),YEAR=pl.col("SETTLEMENTDATE").dt.iso_year()).collect()
     36   return final.to_arrow()

[/usr/local/lib/python3.10/dist-packages/polars/lazyframe/frame.py](https://localhost:8080/#) in collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, no_optimization, streaming, background, _eager, **_kwargs)
   1815         callback = _kwargs.get("post_opt_callback")
   1816

-> 1817 return wrap_df(ldf.collect(callback))
1818
1819 @overload

PanicException: called Option::unwrap() on a None value

cmdlineluser · 2024-05-23T15:48:52Z

Thanks @djouallah - it seems that was a different issue.

If you can make minimal test cases, it makes it easier for the devs to fix.

As an example, I made a minimal repro for you in #16437

It has just been fixed and will be part of 0.20.29 which should be released soon.

djouallah added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Mar 23, 2024

djouallah changed the title ~~Polars can't read all the columns, even when defined using columns~~ Reading CSV, Polars seems to ignore the provided Schema Mar 23, 2024

ritchie46 mentioned this issue May 6, 2024

fix: Respect user passed 'reader_schema' in 'scan_csv' #16080

Merged

ritchie46 closed this as completed in #16080 May 6, 2024

cmdlineluser mentioned this issue May 23, 2024

cluster_with_optimizer PanicException during scan_csv call #16437

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading CSV, Polars seems to ignore the provided Schema #15254

Reading CSV, Polars seems to ignore the provided Schema #15254

djouallah commented Mar 23, 2024

ritchie46 commented Mar 23, 2024

djouallah commented Mar 23, 2024 •

edited

cmdlineluser commented Mar 23, 2024 •

edited

cmdlineluser commented Mar 24, 2024

filabrazilska commented Mar 26, 2024

filabrazilska commented Mar 26, 2024

djouallah commented Apr 8, 2024

cmdlineluser commented Apr 9, 2024

ritchie46 commented May 6, 2024

djouallah commented May 23, 2024

cmdlineluser commented May 23, 2024 •

edited

Reading CSV, Polars seems to ignore the provided Schema #15254

Reading CSV, Polars seems to ignore the provided Schema #15254

Comments

djouallah commented Mar 23, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

ritchie46 commented Mar 23, 2024

djouallah commented Mar 23, 2024 • edited

cmdlineluser commented Mar 23, 2024 • edited

cmdlineluser commented Mar 24, 2024

filabrazilska commented Mar 26, 2024

filabrazilska commented Mar 26, 2024

djouallah commented Apr 8, 2024

cmdlineluser commented Apr 9, 2024

ritchie46 commented May 6, 2024

djouallah commented May 23, 2024

cmdlineluser commented May 23, 2024 • edited

djouallah commented Mar 23, 2024 •

edited

cmdlineluser commented Mar 23, 2024 •

edited

cmdlineluser commented May 23, 2024 •

edited