read_excel with engine="calamine" infer_schema_length=0 returns an empty DataFrame #16408

niekrongen · 2024-05-22T15:54:21Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

[file1.xlsx](https://github.com/pola-rs/polars/files/15405767/file1.xlsx)

print(
    pl.read_excel(
        r"<path_to_file1>\file1.xlsx",
        engine="calamine",
        infer_schema_length=0,
    )
)

print(
    pl.read_excel(
        r"<path_to_file1>\file1.xlsx",
        engine="calamine",
        infer_schema_length=1,
    )
)

print(
    pl.read_excel(
        r"<path_to_file1>\file1.xlsx",
        engine="xlsx2csv",
        infer_schema_length=0,
    )
)

Log output

shape: (0, 4)
┌─────────┬─────────┬─────────┬─────────┐
│ Column1 ┆ Column2 ┆ Column3 ┆ Column4 │
│ ---     ┆ ---     ┆ ---     ┆ ---     │
│ null    ┆ null    ┆ null    ┆ null    │
╞═════════╪═════════╪═════════╪═════════╡
└─────────┴─────────┴─────────┴─────────┘
shape: (13, 4)
┌─────────┬─────────┬─────────┬─────────┐
│ Column1 ┆ Column2 ┆ Column3 ┆ Column4 │
│ ---     ┆ ---     ┆ ---     ┆ ---     │
│ str     ┆ str     ┆ str     ┆ str     │
╞═════════╪═════════╪═════════╪═════════╡
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ …       ┆ …       ┆ …       ┆ …       │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
└─────────┴─────────┴─────────┴─────────┘
shape: (13, 4)
┌─────────┬─────────┬─────────┬─────────┐
│ Column1 ┆ Column2 ┆ Column3 ┆ Column4 │
│ ---     ┆ ---     ┆ ---     ┆ ---     │
│ str     ┆ str     ┆ str     ┆ str     │
╞═════════╪═════════╪═════════╪═════════╡
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ …       ┆ …       ┆ …       ┆ …       │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
│ value   ┆ value   ┆ value   ┆ value   │
└─────────┴─────────┴─────────┴─────────┘

Issue description

infer_schema_length is supported for calamine, however when setting it to 0 results in an empty DataFrame instead of the expected result that all types are string.
All types are string with infer_schema_length for xlsx2csv.

Expected behavior

All columns default to datatype string as this is the default for xlsx2csv.

Installed versions

--------Version info---------
Polars:               0.20.27
Index type:           UInt32
Platform:             Windows-10-10.0.22631-SP0
Python:               3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            0.10.4
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              16.1.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             0.8.2
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

niekrongen · 2024-05-22T17:46:48Z

While this is inconsistent, it has to do with the way that fastexcel handles the parameter schema_sample_rows.

A more detailed description of this issue is submitted to the fastexcel github repository:
ToucanToco/fastexcel#236

A fix for this can for now be using dtypes in the read_options where all columns are set to dtype "string" as shown below:
pl.read_excel( r"<path_to_file1>\file1.xlsx", engine="calamine", read_options={"dtypes": {i: "string" for i in range(16384)}}, infer_schema_length=0, )

niekrongen added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 22, 2024

alexander-beedie self-assigned this May 22, 2024

This was referenced Jun 7, 2024

fix(python): Ensure read_excel and read_ods return identical frames across all engines when given empty spreadsheet tables #16802

Merged

fix(python): Consistent behaviour when "infer_schema_length=0" for read_excel #16840

Merged

alexander-beedie added A-io-excel Area: reading/writing Excel files and removed needs triage Awaiting prioritization by a maintainer labels Jun 9, 2024

ritchie46 closed this as completed in #16840 Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_excel with engine="calamine" infer_schema_length=0 returns an empty DataFrame #16408

read_excel with engine="calamine" infer_schema_length=0 returns an empty DataFrame #16408

niekrongen commented May 22, 2024

niekrongen commented May 22, 2024

read_excel with engine="calamine" infer_schema_length=0 returns an empty DataFrame #16408

read_excel with engine="calamine" infer_schema_length=0 returns an empty DataFrame #16408

Comments

niekrongen commented May 22, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

niekrongen commented May 22, 2024