Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_excel with engine="calamine" infer_schema_length=0 returns an empty DataFrame #16408

Closed
2 tasks done
niekrongen opened this issue May 22, 2024 · 1 comment · Fixed by #16840
Closed
2 tasks done
Assignees
Labels
A-io-excel Area: reading/writing Excel files bug Something isn't working python Related to Python Polars

Comments

@niekrongen
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

[file1.xlsx](https://github.com/pola-rs/polars/files/15405767/file1.xlsx)

print(
    pl.read_excel(
        r"<path_to_file1>\file1.xlsx",
        engine="calamine",
        infer_schema_length=0,
    )
)

print(
    pl.read_excel(
        r"<path_to_file1>\file1.xlsx",
        engine="calamine",
        infer_schema_length=1,
    )
)

print(
    pl.read_excel(
        r"<path_to_file1>\file1.xlsx",
        engine="xlsx2csv",
        infer_schema_length=0,
    )
)

Log output

shape: (0, 4)
┌─────────┬─────────┬─────────┬─────────┐
│ Column1 ┆ Column2 ┆ Column3 ┆ Column4 │
│ ---     ┆ ---     ┆ ---     ┆ ---     │
│ null    ┆ null    ┆ null    ┆ null    │
╞═════════╪═════════╪═════════╪═════════╡
└─────────┴─────────┴─────────┴─────────┘
shape: (13, 4)
┌─────────┬─────────┬─────────┬─────────┐
│ Column1 ┆ Column2 ┆ Column3 ┆ Column4 │
│ ---     ┆ ---     ┆ ---     ┆ ---     │
│ str     ┆ str     ┆ str     ┆ str     │
╞═════════╪═════════╪═════════╪═════════╡
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ …       ┆ …       ┆ …       ┆ …       │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
└─────────┴─────────┴─────────┴─────────┘
shape: (13, 4)
┌─────────┬─────────┬─────────┬─────────┐
│ Column1 ┆ Column2 ┆ Column3 ┆ Column4 │
│ ---     ┆ ---     ┆ ---     ┆ ---     │
│ str     ┆ str     ┆ str     ┆ str     │
╞═════════╪═════════╪═════════╪═════════╡
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ …       ┆ …       ┆ …       ┆ …       │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
└─────────┴─────────┴─────────┴─────────┘

Issue description

infer_schema_length is supported for calamine, however when setting it to 0 results in an empty DataFrame instead of the expected result that all types are string.
All types are string with infer_schema_length for xlsx2csv.

Expected behavior

All columns default to datatype string as this is the default for xlsx2csv.

Installed versions

--------Version info---------
Polars:               0.20.27
Index type:           UInt32
Platform:             Windows-10-10.0.22631-SP0
Python:               3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            0.10.4
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              16.1.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             0.8.2
xlsxwriter:           <not installed>
@niekrongen niekrongen added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 22, 2024
@niekrongen
Copy link
Author

While this is inconsistent, it has to do with the way that fastexcel handles the parameter schema_sample_rows.

A more detailed description of this issue is submitted to the fastexcel github repository:
ToucanToco/fastexcel#236

A fix for this can for now be using dtypes in the read_options where all columns are set to dtype "string" as shown below:
pl.read_excel( r"<path_to_file1>\file1.xlsx", engine="calamine", read_options={"dtypes": {i: "string" for i in range(16384)}}, infer_schema_length=0, )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-excel Area: reading/writing Excel files bug Something isn't working python Related to Python Polars
Projects
None yet
2 participants