Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading CSV, Polars seems to ignore the provided Schema #15254

Closed
2 tasks done
djouallah opened this issue Mar 23, 2024 · 11 comments · Fixed by #16080
Closed
2 tasks done

Reading CSV, Polars seems to ignore the provided Schema #15254

djouallah opened this issue Mar 23, 2024 · 11 comments · Fixed by #16080
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@djouallah
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

I am attaching a full reproducible example here,

https://github.com/djouallah/Light_ETL_Challenge/blob/main/Light_ETL_Challenge.ipynb

colmn=[
'I','UNIT','XX','VERSION','SETTLEMENTDATE','RUNNO',
    'DUID','INTERVENTION','DISPATCHMODE','AGCSTATUS','INITIALMW',
    'TOTALCLEARED','RAMPDOWNRATE','RAMPUPRATE','LOWER5MIN',
    'LOWER60SEC','LOWER6SEC','RAISE5MIN','RAISE60SEC',
    'RAISE6SEC','MARGINAL5MINVALUE','MARGINAL60SECVALUE',
    'MARGINAL6SECVALUE','MARGINALVALUE','VIOLATION5MINDEGREE'
]
raw = pl.scan_csv(f'{raw_landing}/csv/*.CSV',skip_rows=1, new_columns=colmn,has_header=False, infer_schema_length=0,truncate_ragged_lines=True)
transform =(
    raw
    .filter(pl.col("I")=='D')
    .filter(pl.col("UNIT")=='DUNIT')
    .filter(pl.col("VERSION")=='3')
    .drop("XX")
    .drop("I")
)

Log output

No response

Issue description

basically, there are more than 25 columns, but Polars so far seems to account only for the total columns founds at row =1

Expected behavior

Polars should read a csv with variable column numbers

Installed versions

Python polars-0.20.16

@djouallah djouallah added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Mar 23, 2024
@ritchie46
Copy link
Member

Have you tried setting the full schema? Setting only columns will set the names of the schema, but Polars will still determine the schema itself, which in this case will likely be done based on the header.

We accept a schema argument that will help you completely overwrite the schema.

@djouallah
Copy link
Author

djouallah commented Mar 23, 2024

does not seems to be working, it is still ignoring the schema, it still provide a header, column_1 , 2 etc based on the number of columns in row=1

image

@djouallah djouallah changed the title Polars can't read all the columns, even when defined using columns Reading CSV, Polars seems to ignore the provided Schema Mar 23, 2024
@cmdlineluser
Copy link
Contributor

cmdlineluser commented Mar 23, 2024

If I understand correctly, this appears to be a minimal repro?

Data:

wget https://nemweb.com.au/Reports/Current/Daily_Reports/PUBLIC_DAILY_202401270000_20240128040505.zip
unzip PUBLIC_DAILY_202401270000_20240128040505.zip

.read_csv works as expected.

import polars as pl

pl.read_csv(
    "PUBLIC_DAILY_202401270000_20240128040505.CSV",
    skip_rows=1,
    schema=dict.fromkeys([f"column_{n}" for n in range(1, 131)], pl.String),
    has_header=False
)
# shape: (424_519, 130) # <- OK: 130 Columns
# ...

The same arguments with .scan_csv raises an exception:

# ComputeError: found more fields than defined in 'Schema'
# Consider setting 'truncate_ragged_lines=True'.

With truncate_ragged_lines=True the file is read but we no longer get 130 columns.

pl.scan_csv(
    "PUBLIC_DAILY_202401270000_20240128040505.CSV",
    skip_rows=1,
    schema=dict.fromkeys([f"column_{n}" for n in range(1, 131)], pl.String),
    has_header=False,
    truncate_ragged_lines=True
).collect()

# shape: (424_519, 25) # <- ERROR!!! 25 Columns
# ...

(With read_csv(..., truncate_ragged_lines=True) we still get 130 Columns.)

@cmdlineluser
Copy link
Contributor

I suppose the data is not actually needed.

Simpler repro:

import polars as pl
import tempfile

with tempfile.NamedTemporaryFile() as f:
    f.write(b"""
A,B,C
1,2,3
4,5,6,7,8
9,10,11
""".strip())
    f.seek(0)
    
    df = pl.read_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True)
    # shape: (3, 5)
    # ┌─────┬─────┬─────┬──────┬──────┐
    # │ A   ┆ B   ┆ C   ┆ D    ┆ E    │
    # │ --- ┆ --- ┆ --- ┆ ---  ┆ ---  │
    # │ str ┆ str ┆ str ┆ str  ┆ str  │
    # ╞═════╪═════╪═════╪══════╪══════╡
    # │ 1   ┆ 2   ┆ 3   ┆ null ┆ null │
    # │ 4   ┆ 5   ┆ 6   ┆ 7    ┆ 8    │
    # │ 9   ┆ 10  ┆ 11  ┆ null ┆ null │
    # └─────┴─────┴─────┴──────┴──────┘
    
    lf = pl.scan_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True).collect()
    # shape: (3, 3)
    # ┌─────┬─────┬─────┐
    # │ A   ┆ B   ┆ C   │
    # │ --- ┆ --- ┆ --- │
    # │ str ┆ str ┆ str │
    # ╞═════╪═════╪═════╡
    # │ 1   ┆ 2   ┆ 3   │
    # │ 4   ┆ 5   ┆ 6   │
    # │ 9   ┆ 10  ┆ 11  │
    # └─────┴─────┴─────┘

filabrazilska added a commit to filabrazilska/polars that referenced this issue Mar 26, 2024
When passed in as dtypes, the schema inference is not skipped. That has
the side effect that only the first `n` columns from the passed-in
schema are eventually used (see the `infer_file_schema_inner` method
in the `polars-io/src/csv/utils.rs` file).

With the change `scan_csv` behaves the same as `read_csv` when used with
a schema having more columns that the file header:

```python
with tempfile.NamedTemporaryFile() as f:
    f.write(b"""
 A,B,C
 1,2,3
 4,5,6,7,8
 9,10,11
 """.strip())
    f.seek(0)
    df = pl.read_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True)
    print(df)
    lf = pl.scan_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True).collect()
    print(lf)
...
>>> check()
shape: (3, 5)
┌─────┬─────┬─────┬──────┬──────┐
│ A   ┆ B   ┆ C   ┆ D    ┆ E    │
│ --- ┆ --- ┆ --- ┆ ---  ┆ ---  │
│ str ┆ str ┆ str ┆ str  ┆ str  │
╞═════╪═════╪═════╪══════╪══════╡
│ 1   ┆ 2   ┆ 3   ┆ null ┆ null │
│ 4   ┆ 5   ┆ 6   ┆ 7    ┆ 8    │
│ 9   ┆ 10  ┆ 11  ┆ null ┆ null │
└─────┴─────┴─────┴──────┴──────┘
shape: (3, 5)
┌─────┬─────┬─────┬──────┬──────┐
│ A   ┆ B   ┆ C   ┆ D    ┆ E    │
│ --- ┆ --- ┆ --- ┆ ---  ┆ ---  │
│ str ┆ str ┆ str ┆ str  ┆ str  │
╞═════╪═════╪═════╪══════╪══════╡
│ 1   ┆ 2   ┆ 3   ┆ null ┆ null │
│ 4   ┆ 5   ┆ 6   ┆ 7    ┆ 8    │
│ 9   ┆ 10  ┆ 11  ┆ null ┆ null │
└─────┴─────┴─────┴──────┴──────┘
```
@filabrazilska
Copy link
Contributor

Hi,
the difference between read_csv and scan_csv is in the fact that the former uses the passed-in schema for the schema attribute in CsvReader whereas the latter uses is for the dtypes attribute.
When changed the two behave the same (see my commit above). That said I don't know if anyone depends on the original behaviour so not sure if the maintainers are willing the accept this change.

@filabrazilska
Copy link
Contributor

The documentation for scan_csv seems to suggest that the schema param should indeed be used for schema rather than dtypes: https://docs.pola.rs/py-polars/html/reference/api/polars.scan_csv.html

@djouallah
Copy link
Author

Any updates on this?

@cmdlineluser
Copy link
Contributor

It looks like @filabrazilska did file a PR to address this #15305

But it hasn't been reviewed yet.

(PRs can be linked to issues with keywords: https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword)

filabrazilska added a commit to filabrazilska/polars that referenced this issue Apr 10, 2024
When passed in as dtypes, the schema inference is not skipped. That has
the side effect that only the first `n` columns from the passed-in
schema are eventually used (see the `infer_file_schema_inner` method
in the `polars-io/src/csv/utils.rs` file).

With the change `scan_csv` behaves the same as `read_csv` when used with
a schema having more columns that the file header:

```python
with tempfile.NamedTemporaryFile() as f:
    f.write(b"""
 A,B,C
 1,2,3
 4,5,6,7,8
 9,10,11
 """.strip())
    f.seek(0)
    df = pl.read_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True)
    print(df)
    lf = pl.scan_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True).collect()
    print(lf)
...
>>> check()
shape: (3, 5)
┌─────┬─────┬─────┬──────┬──────┐
│ A   ┆ B   ┆ C   ┆ D    ┆ E    │
│ --- ┆ --- ┆ --- ┆ ---  ┆ ---  │
│ str ┆ str ┆ str ┆ str  ┆ str  │
╞═════╪═════╪═════╪══════╪══════╡
│ 1   ┆ 2   ┆ 3   ┆ null ┆ null │
│ 4   ┆ 5   ┆ 6   ┆ 7    ┆ 8    │
│ 9   ┆ 10  ┆ 11  ┆ null ┆ null │
└─────┴─────┴─────┴──────┴──────┘
shape: (3, 5)
┌─────┬─────┬─────┬──────┬──────┐
│ A   ┆ B   ┆ C   ┆ D    ┆ E    │
│ --- ┆ --- ┆ --- ┆ ---  ┆ ---  │
│ str ┆ str ┆ str ┆ str  ┆ str  │
╞═════╪═════╪═════╪══════╪══════╡
│ 1   ┆ 2   ┆ 3   ┆ null ┆ null │
│ 4   ┆ 5   ┆ 6   ┆ 7    ┆ 8    │
│ 9   ┆ 10  ┆ 11  ┆ null ┆ null │
└─────┴─────┴─────┴──────┴──────┘
```
filabrazilska added a commit to filabrazilska/polars that referenced this issue Apr 18, 2024
When passed in as dtypes, the schema inference is not skipped. That has
the side effect that only the first `n` columns from the passed-in
schema are eventually used (see the `infer_file_schema_inner` method
in the `polars-io/src/csv/utils.rs` file).

With the change `scan_csv` behaves the same as `read_csv` when used with
a schema having more columns that the file header:

```python
with tempfile.NamedTemporaryFile() as f:
    f.write(b"""
 A,B,C
 1,2,3
 4,5,6,7,8
 9,10,11
 """.strip())
    f.seek(0)
    df = pl.read_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True)
    print(df)
    lf = pl.scan_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True).collect()
    print(lf)
...
>>> check()
shape: (3, 5)
┌─────┬─────┬─────┬──────┬──────┐
│ A   ┆ B   ┆ C   ┆ D    ┆ E    │
│ --- ┆ --- ┆ --- ┆ ---  ┆ ---  │
│ str ┆ str ┆ str ┆ str  ┆ str  │
╞═════╪═════╪═════╪══════╪══════╡
│ 1   ┆ 2   ┆ 3   ┆ null ┆ null │
│ 4   ┆ 5   ┆ 6   ┆ 7    ┆ 8    │
│ 9   ┆ 10  ┆ 11  ┆ null ┆ null │
└─────┴─────┴─────┴──────┴──────┘
shape: (3, 5)
┌─────┬─────┬─────┬──────┬──────┐
│ A   ┆ B   ┆ C   ┆ D    ┆ E    │
│ --- ┆ --- ┆ --- ┆ ---  ┆ ---  │
│ str ┆ str ┆ str ┆ str  ┆ str  │
╞═════╪═════╪═════╪══════╪══════╡
│ 1   ┆ 2   ┆ 3   ┆ null ┆ null │
│ 4   ┆ 5   ┆ 6   ┆ 7    ┆ 8    │
│ 9   ┆ 10  ┆ 11  ┆ null ┆ null │
└─────┴─────┴─────┴──────┴──────┘
```
@ritchie46
Copy link
Member

Thanks for the report and thank you for the minimal repro @cmdlineluser. Taking a look.

@djouallah
Copy link
Author

@ritchie46 there is a regression with the latest update

PanicException                            Traceback (most recent call last)
<timed exec> in <module>

[<ipython-input-11-3bc5859e5af2>](https://localhost:8080/#) in polars_clean_csv(x)
     33   z = transform.with_columns(pl.col("SETTLEMENTDATE").str.to_datetime())
     34   columns = list(set(transform.columns) - {'SETTLEMENTDATE','DUID','UNIT'})
---> 35   final=z.with_columns(pl.col(columns).cast(pl.Float64),YEAR=pl.col("SETTLEMENTDATE").dt.iso_year()).collect()
     36   return final.to_arrow()

[/usr/local/lib/python3.10/dist-packages/polars/lazyframe/frame.py](https://localhost:8080/#) in collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, no_optimization, streaming, background, _eager, **_kwargs)
   1815         callback = _kwargs.get("post_opt_callback")
   1816 

-> 1817 return wrap_df(ldf.collect(callback))
1818
1819 @overload

PanicException: called Option::unwrap() on a None value

@cmdlineluser
Copy link
Contributor

cmdlineluser commented May 23, 2024

Thanks @djouallah - it seems that was a different issue.

If you can make minimal test cases, it makes it easier for the devs to fix.

As an example, I made a minimal repro for you in #16437

It has just been fixed and will be part of 0.20.29 which should be released soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants