DEPR: Adjust read excel behavior for xlrd >= 2.0 #38571

rhshadrach · 2020-12-18T23:42:20Z

closes shift default excel read engine from xlrd to openpyxl #38424
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Alternative to #38522. I've been testing this locally using both xlrd 1.2.0 and 2.0.1.

One test fails because we used to default to xlrd but now default to openpyxl, it's not clear to me if this test should be passing with openpyxl.

cc @cjw296, @jreback, @jorisvandenbossche

rhshadrach · 2020-12-19T01:25:38Z

Windows failures are related, will be investigating.

cjw296 · 2020-12-19T07:22:48Z

pandas/io/excel/_base.py

+                else:
+                    peek = buf
+        except FileNotFoundError:
+            # File may be a url, return the extension


Why provide degraded inference just because the source is a URL?
It appears this code goes out of its way to avoid using seek, whereas the ODS inference code that was there before, and xlrd which is being hit in many of the existing code paths already does seek without issue.

Previous ODS code read at most 84 bytes; current code needs the entire file. I'll do a partial revert here and utilize the previous ODS code, but I have reservations about downloading the entire file here. Would like to hear others' thoughts.

Ah; I think I have falsely assumed we need to get the entire contents of the file. I think it should be possible to get BufferedIOBase/RawIOBase into the proper form for ZipFile without reading.

I did put quite a lot of effort into #38424; the code in that PR is the way it is from, as carefully as I could, following through the various code paths and ensuring the behaviour was as simple and robust as it could be, in spite of less automated testing than I was expecting. It's tough to see these kind of comments which sort of imply that I hadn't thought any of this through...

In no way shape or form did I have any intention of implying such a thing. As my comment above says, I was mistaken.

cjw296 · 2020-12-19T07:34:25Z

I believe this would also replace #38456.

I find the desire to support xlrd 1.2 at a time when pandas users are already upgrading a package to be both surprising and disappointing, given the potential security issues and poor parsing experience associated with sticking with xlrd 1.2.

This PR has code that is more complex than #38522, even ignoring that which is in place to support the convoluted deprecation process it advocates for. I haven't seen code coverage metrics as part of the pandas PR process, but are you sure all the new conditional branches you've introduced are covered by sufficient tests?

In case it wasn't clear in #38522, I believe the most robust approach in this area is to:

obtain the stream we're going to end up with anyway
peek the minimum number of bytes from it to do content-based inference
.seek(0) to get the stream back to its initial state.

rhshadrach · 2020-12-19T13:43:39Z

Thanks for the comments @cjw296, I was able to simplify/improve a lot from them.

cjw296 · 2020-12-19T14:06:01Z

@rhshadrach - I'm confused as to why you didn't just start with #38522 and add the fallback logic you're advocating for. (corrected)

rhshadrach · 2020-12-19T14:13:26Z

@cjw296

I'm confused as to why you didn't just start with #38424 and add the fallback logic you're advocating for.

I think you mean your PRs #38456/#38522. I had started this work 7 days ago, prior to the existence of them.

rhshadrach · 2020-12-19T16:09:16Z

2 linux tests failed to start, all other checks passed. Rerunning.

rhshadrach · 2020-12-19T16:09:23Z

/azp run

azure-pipelines · 2020-12-19T16:09:32Z

Azure Pipelines successfully started running 1 pipeline(s).

rhshadrach · 2020-12-19T17:05:39Z

Update: Failure if from one of the builds using zh_CN.utf8

Failure is related, though I don't understand it:

>               handle = open(handle, ioargs.mode)
E               FileNotFoundError: [Errno 2] 没有那个文件或目录: 'foo.xlsm'

[snip]

with pytest.raises(FileNotFoundError, match="No such file or directory"):
>           pd.read_excel(bad_file)
E           AssertionError: Regex pattern 'No such file or directory' does not match "[Errno 2] 没有那个文件或目录: 'foo.xlsm'".

…rd_warnings

jreback

lgtm. @jorisvandenbossche @simonjayhawkins

pandas/io/excel/_base.py

…rd_warnings

jorisvandenbossche · 2020-12-21T15:10:18Z

Regardless of which PR was started first, it's also an option to update #38522 to use the behaviour we want (fallback to xlrd if openpyxl is not installed, for xlrd < 2.0, see also #38522 (review), I think it should be fairly straightforward to add that to that PR)

jreback · 2020-12-21T15:30:12Z

Regardless of which PR was started first, it's also an option to update #38522 to use the behaviour we want (fallback to xlrd if openpyxl is not installed, for xlrd < 2.0, see also #38522 (review), I think it should be fairly straightforward to add that to that PR)

ok sure. let's try to get this in today and cut the release. i view better inference as good but shouldn't hold this up.

rhshadrach · 2020-12-21T16:12:15Z

@jreback @jorisvandenbossche Is there room for improving the inference done here? As far as I can tell, the inference done here is that of #38522 but also handles bytes and urls (which I believe is causing failures in #38522).

rhshadrach · 2020-12-21T16:13:05Z

Two of the travis builds failed to start, the third one passed. Restarted manually.

jreback

very minor comment. pls commetn / update

pandas/tests/io/excel/test_writers.py

pandas/io/excel/_base.py

rhshadrach · 2020-12-23T14:46:13Z

@rhshadrach

https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=50726&view=logs&j=404760ec-14d3-5d48-e580-13034792878f&t=f81e4cc8-d61a-5fb8-36be-36768e5c561a looks like a legit failure

@jreback - Agreed, it is, but I don't understand it. The warning is emitted when openpyxl is not installed but xlrd < 2.0 is and a non-xls file is used. Replicating this in my environment, the test passes with the warning emitted with the right stacklevel (checked manually). So maybe this is a Windows vs Linux issue?

What really doesn't make sense is that actual_warning.filename is pandas\io\excel\_base.py where the Warning is emitted, but the stacklevel is very clearly being set to either be 2 or 4. I don't see how it's possible that _base.py is then the filename.

Attempting to debug via CI now.

jorisvandenbossche · 2020-12-23T14:52:01Z

If it only fails on windows, I think it is also fine to add a check_stacklevel=False to the particular test ..

rhshadrach · 2020-12-23T14:54:18Z

If it only fails on windows, I think it is also fine to add a check_stacklevel=False to the particular test ..

Will do, is this a known issue with Windows? Couldn't find an issue on github, will add one if it is.

jorisvandenbossche · 2020-12-23T14:56:37Z

Actually, it's a test asserting warnings about positional arguments, not about the engine. So that might interfere with the test, because it now also raises another warning than the one it is testing?

rhshadrach · 2020-12-23T15:04:34Z

@jorisvandenbossche Yes, that appears to me to be the reason why it's failing now. However, it passes locally for me and stepping through the assert code, I don't see any reason why it might not work on Windows. But most likely I'm missing some peculiarity about windows..

jorisvandenbossche · 2020-12-23T15:07:15Z

I think it is certainly fine to skip the check for stacklevel here (or mayble only skip it on windows, so it's still tested generally)

jorisvandenbossche · 2020-12-23T16:49:24Z

@rhshadrach sorry, I merged #38456, which will give some merge conflict pains. I can also deal with it if you like

…rd_warnings � Conflicts: � doc/source/whatsnew/v1.2.0.rst � pandas/io/excel/_base.py

rhshadrach · 2020-12-23T16:57:55Z

@jorisvandenbossche - Not a problem. The merge was fairly straightforward, but your review would be appreciated to make sure I didn't mess any of it up. (I am also double checking it now)

The stacklevel issue was a bug in one of my previous PRs where the file path was hardcoded using '/'. Should be fixed now. Also your requested test has been added.

jorisvandenbossche · 2020-12-23T17:26:46Z

Doc changes look good after the merge!

jorisvandenbossche · 2020-12-23T17:30:43Z

pandas/tests/io/excel/test_readers.py

+
+    def test_corrupt_bytes_raises(self, read_ext, engine):
+        bad_stream = b"foo"
+        with pytest.raises(BadZipFile, match="File is not a zip file"):


Could also be left for a follow-up (time to get this merged ;-)), but this is not a super clear error message in case you accidentally pass a non-excel file to read_excel

We should probably check in the inspect function if it is a zip file (like is also done here: https://github.com/pandas-dev/pandas/pull/38522/files#diff-63200ddb7f5656b8ee868a28d9cb7720ffe50689b0e3fb0b4e15cc5c0ae80dd7R942), and if not raise an error like "file is not recognized as an excel file" or so.

jreback · 2020-12-23T20:32:37Z

@rhshadrach unfortunately a couple of legit looking failures. note happy to xfail those tests for those versions is fine.

rhshadrach · 2020-12-23T20:37:42Z

@jreback - the test suggested by @jorisvandenbossche detected that we were not defaulting to pyxlsb for xlsb files when engine is None; opened #38667 and xfailed the test.

jreback · 2020-12-23T20:59:20Z

@jreback - the test suggested by @jorisvandenbossche detected that we were not defaulting to pyxlsb for xlsb files when engine is None; opened #38667 and xfailed the test.

excellent, ping on green-ish

jreback · 2020-12-23T23:01:46Z

thanks @rhshadrach amazing job here.

and thank you @cjw296 for your PR and the input.

jreback · 2020-12-23T23:01:58Z

@meeseeksdev backport 1.2.x

…rd >= 2.0

…38670) Co-authored-by: Richard Shadrach <45562402+rhshadrach@users.noreply.github.com>

jorisvandenbossche · 2020-12-24T09:04:59Z

Thanks @rhshadrach !

DEPR: Adjust read excel behavior for xlrd >= 2.0

06d79b9

cjw296 reviewed Dec 19, 2020

View reviewed changes

cjw296 mentioned this pull request Dec 19, 2020

shift default excel read engine from xlrd to openpyxl #38424

Closed

rhshadrach added 2 commits December 19, 2020 06:03

Use stringified path, suppress warnings in io.test_common

e2c29ff

Use get_handle & seek

f56970b

inspection raises if not a xls/zip file; tests for missing/corrupt file

e992049

rhshadrach added 3 commits December 20, 2020 08:02

Merge branch 'master' of https://github.com/pandas-dev/pandas into xl…

13cd483

…rd_warnings

Minor doc fixes

792b53a

Fix test to handle zh_CN.utf8

266b18b

jreback added the IO Excel read_excel, to_excel label Dec 21, 2020

jreback added this to the 1.2 milestone Dec 21, 2020

jreback approved these changes Dec 21, 2020

View reviewed changes

pandas/io/excel/_base.py Outdated Show resolved Hide resolved

jreback mentioned this pull request Dec 21, 2020

RLS: 1.2 #37784

Closed

rhshadrach added 2 commits December 21, 2020 09:59

Use LooseVersion

56cf956

Merge branch 'master' of https://github.com/pandas-dev/pandas into xl…

e1878a0

…rd_warnings

jreback requested changes Dec 22, 2020

View reviewed changes

pandas/tests/io/excel/test_writers.py Outdated Show resolved Hide resolved

pandas/io/excel/_base.py Outdated Show resolved Hide resolved

DEPR: Adjust read excel behavior for xlrd >= 2.0

0146258

Added debug statements Windows

59e0cb9

Fixed stacklevel determination for Windows, added test

8f8b74e

Merge branch 'master' of https://github.com/pandas-dev/pandas into xl…

f286497

…rd_warnings � Conflicts: � doc/source/whatsnew/v1.2.0.rst � pandas/io/excel/_base.py

jorisvandenbossche reviewed Dec 23, 2020

View reviewed changes

Changed error message on non-zip file

8c95acd

rhshadrach mentioned this pull request Dec 23, 2020

BUG: read_excel does not use pyxlsb for xlsb files when engine is None #38667

Closed

rhshadrach added 2 commits December 23, 2020 15:34

Handle error with xlsb files

4619363

xfail instead

05a6f09

jreback merged commit 263e1ee into pandas-dev:master Dec 23, 2020

meeseeksmachine mentioned this pull request Dec 23, 2020

Backport PR #38571 on branch 1.2.x (DEPR: Adjust read excel behavior for xlrd >= 2.0) #38670

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Dec 23, 2020

Backport PR pandas-dev#38571: DEPR: Adjust read excel behavior for xl…

933e870

…rd >= 2.0

rhshadrach deleted the xlrd_warnings branch December 23, 2020 23:02

jreback pushed a commit that referenced this pull request Dec 24, 2020

Backport PR #38571: DEPR: Adjust read excel behavior for xlrd >= 2.0 (#…

1222a46

…38670) Co-authored-by: Richard Shadrach <45562402+rhshadrach@users.noreply.github.com>

luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021

DEPR: Adjust read excel behavior for xlrd >= 2.0 (pandas-dev#38571)

785ec4d

lithomas1 mentioned this pull request May 4, 2021

CI: pin xlrd <2 for now #38524

Closed

DEPR: Adjust read excel behavior for xlrd >= 2.0 #38571

DEPR: Adjust read excel behavior for xlrd >= 2.0 #38571

Conversation

rhshadrach commented Dec 18, 2020 • edited Loading

rhshadrach commented Dec 19, 2020

cjw296 Dec 19, 2020

Choose a reason for hiding this comment

rhshadrach Dec 19, 2020

Choose a reason for hiding this comment

rhshadrach Dec 19, 2020

Choose a reason for hiding this comment

cjw296 Dec 19, 2020

Choose a reason for hiding this comment

rhshadrach Dec 19, 2020

Choose a reason for hiding this comment

cjw296 commented Dec 19, 2020 • edited Loading

rhshadrach commented Dec 19, 2020

cjw296 commented Dec 19, 2020 • edited Loading

rhshadrach commented Dec 19, 2020

rhshadrach commented Dec 19, 2020

rhshadrach commented Dec 19, 2020

azure-pipelines bot commented Dec 19, 2020

rhshadrach commented Dec 19, 2020 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Dec 21, 2020

jreback commented Dec 21, 2020

rhshadrach commented Dec 21, 2020

rhshadrach commented Dec 21, 2020

jreback left a comment

Choose a reason for hiding this comment

rhshadrach commented Dec 23, 2020

jorisvandenbossche commented Dec 23, 2020

rhshadrach commented Dec 23, 2020

jorisvandenbossche commented Dec 23, 2020

rhshadrach commented Dec 23, 2020

jorisvandenbossche commented Dec 23, 2020

jorisvandenbossche commented Dec 23, 2020

rhshadrach commented Dec 23, 2020

jorisvandenbossche commented Dec 23, 2020

jorisvandenbossche Dec 23, 2020

Choose a reason for hiding this comment

jreback commented Dec 23, 2020

rhshadrach commented Dec 23, 2020

jreback commented Dec 23, 2020

jreback commented Dec 23, 2020

jreback commented Dec 23, 2020

jorisvandenbossche commented Dec 24, 2020

rhshadrach commented Dec 18, 2020 •

edited

Loading

cjw296 commented Dec 19, 2020 •

edited

Loading

cjw296 commented Dec 19, 2020 •

edited

Loading

rhshadrach commented Dec 19, 2020 •

edited

Loading