Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): add "pyxlsb" engine support to read_excel (for excel binary workbook files) #11248

Merged
merged 4 commits into from
Sep 26, 2023

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Sep 22, 2023

Closes #11181 (and also closes #11184 by adding a .. versionadded tag to the docs).

  • Adds support for the pyxlsb engine so that we can also read Excel Binary Workbook files (those with an ".xlsb" extension, which are not compatible with any of the existing ".xlsx" engines). Note that this engine does not currently autodetect datetime/date columns (it reads them in as Excel's native offset-Julian float), and therefore requires the use of schema_overrides to load them correctly (I have added a note in the docstring about this).

  • Also: improves Date parsing for OpenOffice files via read_ods).

  • Also: slightly updates show_versions() docs/example with the latest libs.


Support for spreadsheet data is improving nicely; with this update read_excel can now read all of the major Excel formats (".xlsx", ".xlsm", "xlsb"), and we can handle the OpenOffice ".ods" format via read_ods.

Example

Reading ".xlsb" files:

pl.read_excel(
    source = "~/test.xlsb",
    sheet_name="misc_data",
    schema_overrides={"dtm":pl.Datetime, "dt":pl.Date},
)

Before

xlsx2csv.XlsxException: Sheet 'misc_data' not found

After

shape: (2, 3)
┌─────────────────────┬────────────┬──────┐
│ dtm                 ┆ dt         ┆ val  │
│ ---                 ┆ ---        ┆ ---  │
│ datetime[μs]        ┆ date       ┆ f64  │
╞═════════════════════╪════════════╪══════╡
│ 1999-12-31 10:30:45 ┆ 2024-01-01 ┆ 1.5  │
│ 2010-10-11 12:13:14 ┆ 2018-08-07 ┆ -0.5 │
└─────────────────────┴────────────┴──────┘

FYI: @SaelKimberly has been experimenting with writing a potentially superior ".xlsb" reading engine; if & when this is ready (and well tested and available from pypi ;) we can look at including that as the default .xlsb engine instead - looking forward to it 👍

@alexander-beedie alexander-beedie changed the title feat(python): adds "pyxlsb" engine support to read_excel (for reading binary workbook files) feat(python): add "pyxlsb" engine support to read_excel (for reading excel binary workbook files) Sep 22, 2023
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Sep 22, 2023
@alexander-beedie alexander-beedie changed the title feat(python): add "pyxlsb" engine support to read_excel (for reading excel binary workbook files) feat(python): add "pyxlsb" engine support to read_excel (for excel binary workbook files) Sep 22, 2023
Copy link
Member

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff as always! Some minor remarks/questions.

py-polars/tests/unit/io/test_spreadsheet.py Outdated Show resolved Hide resolved
py-polars/tests/unit/io/test_spreadsheet.py Outdated Show resolved Hide resolved
py-polars/polars/utils/show_versions.py Outdated Show resolved Hide resolved
Copy link
Member

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, good to go!

@stinodego stinodego merged commit 2fc17b4 into pola-rs:main Sep 26, 2023
15 checks passed
@alexander-beedie alexander-beedie deleted the read-binary-excel-workbooks branch September 26, 2023 08:18
romanovacca pushed a commit to romanovacca/polars that referenced this pull request Oct 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_excel schema_overrides missing Read XLSX and XLSB files natively
2 participants