DEPR: Deprecate using `xlrd` engine for read_excel #35029

roberthdevries · 2020-06-27T14:26:43Z

closes Deprecate using xlrd engine in favor of openpyxl #28547
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This is MR #29375 but rebased to master

MarcoGorelli

Can you use the method described here https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#testing-warnings to test for warnings?

pandas/io/excel/_base.py

st-pasha · 2020-06-29T20:04:17Z

pandas/io/excel/_base.py

@@ -852,6 +853,14 @@ def __init__(self, path_or_buffer, engine=None):
                ext = os.path.splitext(str(path_or_buffer))[-1]
                if ext == ".ods":
                    engine = "odf"
+
+        if engine == "xlrd":


The warning should not be issued when the parameter engine="xlrd" is passed explicitly.

Hmm, if the engine is deprecated, I would expect that all uses should be discouraged. Explicit or implicit.

xlrd is the only thing that will read legacy .xls files unfortunately, so I don't think we need to outright remove all usage of it but want the default to switch to openpyxl

So are we already switching to openpyxl for everything other than .xls files (except of course .ods files and maybe .xlsb files)?

Yea xlsx and xlsm files (the former I would hope is what the vast majority of people read nowadays)

I have now changed the default engine to openpyxl and added a check to use xlrd for .xls files
This required quite some changes to the tests and a work-around for a rounding error in openpyxl.
See https://foss.heptapod.net/openpyxl/openpyxl/-/issues/1493

WillAyd · 2020-07-01T15:40:31Z

pandas/io/excel/_base.py

@@ -844,14 +844,24 @@ class ExcelFile:

    def __init__(self, path_or_buffer, engine=None):
        if engine is None:
-            engine = "xlrd"
+            engine = "openpyxl"


So this actually changes the engine; I think the first step is to provide the warning that you have below that by default in the future we will switch to using openpyxl

Making an exception for .xls files which have to use xlrd

So that means that a warning shall only be produced for .xlsx and .xlsm files that use xlrd?
And regarding the other remark about switching the engines, that was what you asked for a couple of comments back?

That's right - warn first then change over time

so the change to use the openpyxl engine as the default has to be reverted? Or it is just that the xlrd engine is going to be removed in the future altogether and with that the support for .xls files.

@WillAyd Should I revert the change to make openpyxl the default engine?
And only warn when using xlrd in combination with .xlsx or .xlsm files?

WillAyd · 2020-07-03T22:08:42Z

Yea should warn first can actually change in 2.0

…

Sent from my iPhone

On Jul 3, 2020, at 1:25 PM, Robert de Vries ***@***.***> wrote: @roberthdevries commented on this pull request. In pandas/io/excel/_base.py: > @@ -844,14 +844,24 @@ class ExcelFile: def __init__(self, path_or_buffer, engine=None): if engine is None: - engine = "xlrd" + engine = "openpyxl" @WillAyd Should I revert the change to make openpyxl the default engine? And only warn when using xlrd in combination with .xlsx or .xlsm files? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

jreback · 2020-07-10T14:03:52Z

@roberthdevries canyou update to comments and merge master

simonjayhawkins · 2020-07-24T11:57:15Z

@roberthdevries can you address comments?

roberthdevries · 2020-07-24T14:52:46Z

Not until I am back from vacation in three weeks.

roberthdevries · 2020-08-23T16:23:10Z

I have addressed the remaining comment from @WillAyd to only warn about the pending deprecation.

simonjayhawkins · 2020-08-24T09:40:29Z

doc/source/whatsnew/v1.2.0.rst

@@ -143,6 +143,7 @@ See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for mor
 Deprecations
 ~~~~~~~~~~~~
 - Deprecated parameter ``inplace`` in :meth:`MultiIndex.set_codes` and :meth:`MultiIndex.set_levels` (:issue:`35626`)
+- :func:`read_excel` "xlrd" engine is deprecated for all file types that can be handled by "openpyxl" because "xlrd" is no longer maintained (:issue:`28547`).


can you reword for the benefit end users.

i.e. the default engine for read_excel is changing the the future

maybe say something like openpyxl is the recommended engine as xlrd is no longer maintained

simonjayhawkins · 2020-08-24T09:43:36Z

pandas/io/excel/_openpyxl.py

+            try:
+                # workaround for inaccurate timestamp notation in excel
+                return datetime.fromtimestamp(round(cell.value.timestamp()))
+            except (AttributeError, OSError):
+                return cell.value


why is this changing?

This was a work-around for a bug in openpyxl (https://foss.heptapod.net/openpyxl/openpyxl/-/issues/1493), but is only apparent when you do a round trip save to xlsx and read back xlsx using openpyxl.
As this is not tested in any unit test, this can be removed. Agreed?

If adding code we should have a test. so either need to add test or can remove.

my preference would be to have this in a separate PR, so should raise pandas issue for this if removing.

Fair enough, remove it here and make a separate PR that includes a regression test.

… files to openpyxl

simonjayhawkins · 2020-08-26T09:14:04Z

doc/source/whatsnew/v1.2.0.rst

@@ -144,6 +144,8 @@ Deprecations
 ~~~~~~~~~~~~
 - Deprecated parameter ``inplace`` in :meth:`MultiIndex.set_codes` and :meth:`MultiIndex.set_levels` (:issue:`35626`)
 - Deprecated parameter ``dtype`` in :~meth:`Index.copy` on method all index classes. Use the :meth:`Index.astype` method instead for changing dtype(:issue:`35853`)
+- :func:`read_excel` "xlrd" engine is deprecated. The recommended engine is "openpyxl" for "xlsx" and "xlsm" files, because "xlrd" is no longer maintained (:issue:`28547`).


thanks for updating. it's the ``read_excel "xlrd" engine is deprecated bit that I wanted removed

IIUC the xlrd is not deprecated. it's only that that default engine used will change.

Is it deprecated for use with xlsx files? IOW in the future, will only openpyxl be supported for xlsx? Sounds reasonable to me.

hmm, that's not my understanding of the discussion in #28547

from #28547 (comment)

Considering that I think we need to deprecate using xlrd in favor of openpyxl. We might not necessarily need to remove the former and it does offer some functionality the latter doesn't (namely reading .xls files) but should at the very least start moving towards the latter

Indeed, the discussion has not explicitly mentioned disallowing xlrd for formats that openpyxl supports. but if the xlrd engine is not removed, we should decide now whether we would restrict it's use.

My read of this is to only keep xlrd for xls where it is required, and to deprecate where it is not. In the long run, if xlrd breaks and no one takes over its maintenance, then we will either have to vendor xlrd or remove support for xls. In either case minimizing use of xlrd seems like a good idea to me.

I suppose to me the only point of deprecating is to start a path to removal of a feature. If there is not going to be removal in the future, then why bother with deprecation?

One could always discourage xlrd + xlsx with a noisy FutureWarning telling users that xlrd is unmaintained and they should install openpyxl for reading xlsx files.

@WillAyd is very specific regarding this issue. He states that we should warn first, then change the default (and maybe even remove the xlrd engine for xlsx and xlsm files altogether).
But not at this point, only a deprecation warning is asked, to notify users that this engine is no longer the preferred engine.

Just to clarify we should only be changing the default reader to openpyxl. I think it's fine to keep xlrd around as a YMMV situation

Hmm, I just reverted the changes that made openpyxl the default. I am now very confused. I thought that this change was just about warning people about pending deprecation of the xlrd reader, not to switch them over already.

See your comments of July 1 and 3.

I see where the wording is confusing, but yes we only warn now and change in the future. We always manage user-facing changes that way

WillAyd · 2020-08-26T13:15:52Z

pandas/io/excel/_base.py

                    engine = "odf"
+
+        elif engine == "xlrd" and ext in ("xlsx", "xlsm"):


This warning should be in the if engine is None branch

Are you sure? This will mean that that people also get a warning when they ask for the default (which is still xlrd), instead of when they explicitly ask for xlrd.

Yea so the point of it is that people who want to suppress the warnings will get a head start and explicitly request engine="openpyxl", which is a good thing to sniff out any bugs

luk-f-a · 2020-08-26T15:35:49Z

just because I didn't see it mentioned above, I thought I'd mention that openpyxl is 10x slower. So it's a good thing that the default is not being changed, because otherwise a lot of user would see an impact after upgrading.

erfannariman · 2020-09-17T09:49:58Z

just because I didn't see it mentioned above, I thought I'd mention that openpyxl is 10x slower. So it's a good thing that the default is not being changed, because otherwise a lot of user would see an impact after upgrading.

Interesting, what did you base this on?

luk-f-a · 2020-09-17T09:52:56Z

Interesting, what did you base this on?

Testing based on files I'm working on. After seeing the deprecation notice, we switched our code, and found out our CI was taking one hour longer (!). We traced it back to openpyxl. Fortunately we were aware we had done that change. If pandas had changed the default silently, it would have taken us a long time to figure out what was going on.

simonjayhawkins · 2020-09-17T12:16:47Z

@roberthdevries This branch has conflicts that must be resolved. can you also address @WillAyd comments. #35029 (review)

github-actions · 2020-10-18T00:16:42Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

rhshadrach · 2020-11-30T02:59:11Z

@jreback @jorisvandenbossche

Changes made. I've adjusted the tests only as necessary to either (i) test for openpyxl as being the default or (ii) specified engine="xlrd" if openpyxl would fail and either it seemed the test was specifically for xlrd or there was no clear way to resolve the test otherwise.

I don't know how to write a test for the FutureWarning - in testing environment, I think openpyxl will always be installed and so it will never be raised. As such, I've simply removed the tests for the FutureWarning for now.

As mentioned above, I won't be available until 16:00 EST (21:00 UTC). Anyone is of course welcomed to push this over the finish line.

jreback · 2020-11-30T03:06:18Z

we don't have either installed in the numpy dev environment so can certainly add some basic tests that run (likely we skip everything if we don't have either installed)

rhshadrach · 2020-12-01T01:31:09Z

Linux py37_minimum_versions failure is pandas/tests/resample/test_deprecated.py:98, unrelated.
macOS py37 failed to start, trying again.
All other builds passed.

rhshadrach · 2020-12-01T01:31:16Z

/azp run

azure-pipelines · 2020-12-01T01:31:26Z

Azure Pipelines successfully started running 1 pipeline(s).

rhshadrach · 2020-12-01T02:07:54Z

@jreback tests pass; two failures below on Linux py38_np_dev are unrelated and passed on the previous run.

FAILED pandas/tests/series/indexing/test_indexing.py::test_loc_setitem_2d_to_1d_raises
FAILED pandas/tests/util/test_show_versions.py::test_show_versions - Assertio...

simonjayhawkins · 2020-12-01T09:02:27Z

@jreback tests pass; two failures below on Linux py38_np_dev are unrelated and passed on the previous run.
FAILED pandas/tests/series/indexing/test_indexing.py::test_loc_setitem_2d_to_1d_raises
FAILED pandas/tests/util/test_show_versions.py::test_show_versions - Assertio...

these tests are failing on other PRs too.

jreback

just some formatting considerations. ping when pushed as ok for merge. cc @jorisvandenbossche

jreback · 2020-12-01T13:42:42Z

doc/source/whatsnew/v1.2.0.rst

+.. warning::
+
+   Previously, the default argument ``engine=None`` to ``pd.read_excel``
+   would result in using the xlrd engine in many cases. The engine xlrd is no longer


double back-tick on xlrd (alt can put a link to xlrd itself, e.g. https://xlrd.readthedocs.io/en/latest/)

jreback · 2020-12-01T13:43:13Z

doc/source/whatsnew/v1.2.0.rst

+   following logic is now used to determine the engine.
+
+   - If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt), then odf will be used.
+   - Otherwise if ``path_or_buffer`` is a bytes stream, the file has the extension ``.xls``, or is an xlrd Book instance, then xlrd will be used.


double backtick xlrd / odf (only put the docs link on L14)

jreback · 2020-12-01T13:43:55Z

pandas/io/excel/_base.py

    - "xlrd" supports most old/new Excel file formats.
    - "openpyxl" supports newer Excel file formats.
    - "odf" supports OpenDocument file formats (.odf, .ods, .odt).
    - "pyxlsb" supports Binary Excel files.
+
+    .. versionchanged:: 1.2.0
+        The engine xlrd is no longer maintained, and is not supported with


I think need a blank line here to render (make this section the same as in the whatsnew as per formatting)

No, for versionchanged this is OK (rst .. ;-))

jreback · 2020-12-01T13:45:11Z

doc/source/whatsnew/v1.2.0.rst

+   maintained, and is not supported with python >= 3.9. When ``engine=None``, the
+   following logic is now used to determine the engine.
+
+   - If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt), then odf will be used.


link for odf: https://pypi.org/project/odfpy/

jreback · 2020-12-01T13:45:37Z

doc/source/whatsnew/v1.2.0.rst

+
+   - If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt), then odf will be used.
+   - Otherwise if ``path_or_buffer`` is a bytes stream, the file has the extension ``.xls``, or is an xlrd Book instance, then xlrd will be used.
+   - Otherwise if openpyxl is installed, then openpyxl will be used.


link: https://pypi.org/project/openpyxl/

jreback · 2020-12-01T13:46:00Z

pandas/io/excel/_base.py

+           python >= 3.9. When ``engine=None``, the following logic will be
+           used to determine the engine.
+
+           - If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt),


obviously as much of the formatting you can do here as well

jorisvandenbossche

Small comment on the whatsnew, but it's perfectly fine to only address this later after the RC as well, it's not a blocker

jorisvandenbossche · 2020-12-01T14:25:44Z

pandas/io/excel/_base.py

    - "xlrd" supports most old/new Excel file formats.
    - "openpyxl" supports newer Excel file formats.
    - "odf" supports OpenDocument file formats (.odf, .ods, .odt).
    - "pyxlsb" supports Binary Excel files.
+
+    .. versionchanged:: 1.2.0
+        The engine xlrd is no longer maintained, and is not supported with


No, for versionchanged this is OK (rst .. ;-))

jorisvandenbossche · 2020-12-01T14:29:40Z

doc/source/whatsnew/v1.2.0.rst

+   - If ``path_or_buffer`` is an OpenDocument format (.odf, .ods, .odt), then odf will be used.
+   - Otherwise if ``path_or_buffer`` is a bytes stream, the file has the extension ``.xls``, or is an xlrd Book instance, then xlrd will be used.
+   - Otherwise if openpyxl is installed, then openpyxl will be used.
+   - Otherwise xlrd will be used and a ``FutureWarning`` will be raised.


I would maybe rearrange this list: the most important piece of information we want to convey here is that for xlsx files the default changed from xlrd to openpyxl, if installed. So I would also put that on top of the list (or keep it just to this for the whatsnew, as the other items didn't change. The full list is still in the actual docs).

We don't actually look at the extension or the file format when determining the engine in various cases, so it isn't just changing for xlsx files, right? What do you think of this:

Previously, the default argument ``engine=None`` to ``pd.read_excel`` would result in using the `xlrd <https://xlrd.readthedocs.io/en/latest/>`_ engine in many cases. The engine ``xlrd`` is no longer maintained, and is not supported with python >= 3.9. If `openpyxl <https://pypi.org/project/openpyxl/>`_ is installed, many of these cases will now default to using the ``openpyxl`` engine. See the :func:`read_excel` docs for more details.

jreback · 2020-12-01T16:08:38Z

Small comment on the whatsnew, but it's perfectly fine to only address this later after the RC as well, it's not a blocker

ideally we do this now, because these docs are important (and small change)

jreback · 2020-12-01T23:28:04Z

thanks @rhshadrach and @roberthdevries for this, very nice!

we may need some tweeks during the rc but can be later

rhshadrach · 2020-12-01T23:35:55Z

Thanks @jreback. Happy to support if issues arise.

xlrd 1.2 fails if defusedxml (needed for odf) is installed Bug: pandas-dev/pandas#35029 Bug-Debian: https://bugs.debian.org/976620 Origin: upstream b3a3932af6aafaa2fd41f17e9b7995643e5f92eb Author: Robert de Vries, Rebecca N. Palmer <rebecca_palmer@zoho.com> Forwarded: not-needed Gbp-Pq: Name xlrd_976620.patch

MarcoGorelli requested changes Jun 28, 2020

View reviewed changes

WillAyd added the IO Excel read_excel, to_excel label Jun 29, 2020

st-pasha reviewed Jun 29, 2020

View reviewed changes

roberthdevries force-pushed the fix-28547-deprecate-xlrd branch 2 times, most recently from 9e6474d to 40fbf53 Compare July 1, 2020 07:35

WillAyd requested changes Jul 1, 2020

View reviewed changes

roberthdevries force-pushed the fix-28547-deprecate-xlrd branch 2 times, most recently from 8ed0652 to fad02a5 Compare July 3, 2020 20:23

simonjayhawkins added the Deprecate Functionality to remove in pandas label Jul 24, 2020

roberthdevries force-pushed the fix-28547-deprecate-xlrd branch from fad02a5 to 45e8193 Compare August 23, 2020 14:19

alimcmaster1 mentioned this pull request Aug 23, 2020

CI: specified bucket does not exist in TestParquetPyArrow.test_s3_roundtrip_explicit_fs #35856

Closed

roberthdevries requested review from WillAyd and MarcoGorelli August 23, 2020 16:23

simonjayhawkins reviewed Aug 24, 2020

View reviewed changes

cruzzoe and others added 3 commits August 26, 2020 11:01

Deprecate using xlrd engine and change default engine to read excel…

3a76a36

… files to openpyxl

Revert all changes related to switching to openpyxl as the default

101aa97

Reword whatsnew message for the benefit of end users.

081ecf8

roberthdevries force-pushed the fix-28547-deprecate-xlrd branch from f740ed7 to 081ecf8 Compare August 26, 2020 09:02

simonjayhawkins reviewed Aug 26, 2020

View reviewed changes

WillAyd requested changes Aug 26, 2020

View reviewed changes

simonjayhawkins added the Blocker Blocking issue or pull request for an upcoming release label Nov 30, 2020

rhshadrach added 2 commits November 30, 2020 16:22

Re-added tests, minor doc touchups

f9876dd

Test for no warning as well

bc3ec47

jreback reviewed Dec 1, 2020

View reviewed changes

jorisvandenbossche reviewed Dec 1, 2020

View reviewed changes

Doc tweaks

fe10a89

jreback merged commit b3a3932 into pandas-dev:master Dec 1, 2020

twoertwein mentioned this pull request Dec 1, 2020

getiterator deprecated in Python 3.9; failure to call pd.read_excel() #37795

Closed

rhshadrach mentioned this pull request Dec 4, 2020

Deprecate xlwt #26552

Closed

This was referenced Dec 11, 2020

Deprecate using xlrd engine in favor of openpyxl #28547

Closed

shift default excel read engine from xlrd to openpyxl #38424

Closed

rhshadrach mentioned this pull request Dec 22, 2020

BUG: Roundtrip with openpyxl and datetime precision #38644

Closed

kcharlie2 mentioned this pull request Jan 4, 2021

BUG: read_excel() fails when checking __version__ of older xlrd versions #38955

Closed

3 tasks

JuliaWilkinsSonos mentioned this pull request Jan 4, 2021

BUG: read_excel() using openpyxl engine header argument not working as expected #38956

Closed

3 tasks

st-pasha mentioned this pull request Jan 6, 2021

xlrd no longer supports xlsx, unhelpful error h2oai/datatable#2823

Closed

carrascomj mentioned this pull request Nov 4, 2021

Upgrade to xlrd 2.0.0 + openpyxl h2oai/datatable#3191

Open

roberthdevries deleted the fix-28547-deprecate-xlrd branch March 14, 2022 13:15

WillAyd mentioned this pull request Oct 19, 2022

DEP: Enforce deprecation of mangle_dup cols and convert_float in read_excel #49089

Merged

		engine = "odf"

		elif engine == "xlrd" and ext in ("xlsx", "xlsm"):

DEPR: Deprecate using xlrd engine for read_excel #35029

DEPR: Deprecate using xlrd engine for read_excel #35029

Conversation

roberthdevries commented Jun 27, 2020

MarcoGorelli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roberthdevries Jun 29, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roberthdevries Jul 1, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Jul 3, 2020 via email

jreback commented Jul 10, 2020

simonjayhawkins commented Jul 24, 2020

roberthdevries commented Jul 24, 2020

roberthdevries commented Aug 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bashtage Aug 26, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luk-f-a commented Aug 26, 2020

erfannariman commented Sep 17, 2020

luk-f-a commented Sep 17, 2020 • edited

simonjayhawkins commented Sep 17, 2020

github-actions bot commented Oct 18, 2020

rhshadrach commented Nov 30, 2020

jreback commented Nov 30, 2020

rhshadrach commented Dec 1, 2020

rhshadrach commented Dec 1, 2020

azure-pipelines bot commented Dec 1, 2020

rhshadrach commented Dec 1, 2020

simonjayhawkins commented Dec 1, 2020

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 1, 2020

jreback commented Dec 1, 2020

rhshadrach commented Dec 1, 2020

DEPR: Deprecate using `xlrd` engine for read_excel #35029

DEPR: Deprecate using `xlrd` engine for read_excel #35029

roberthdevries Jun 29, 2020 •

edited

roberthdevries Jul 1, 2020 •

edited

bashtage Aug 26, 2020 •

edited

luk-f-a commented Sep 17, 2020 •

edited