Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: Adjust read excel behavior for xlrd >= 2.0 #38571

Merged
merged 19 commits into from
Dec 23, 2020

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Dec 18, 2020

Alternative to #38522. I've been testing this locally using both xlrd 1.2.0 and 2.0.1.

One test fails because we used to default to xlrd but now default to openpyxl, it's not clear to me if this test should be passing with openpyxl.

cc @cjw296, @jreback, @jorisvandenbossche

@rhshadrach
Copy link
Member Author

Windows failures are related, will be investigating.

else:
peek = buf
except FileNotFoundError:
# File may be a url, return the extension
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why provide degraded inference just because the source is a URL?
It appears this code goes out of its way to avoid using seek, whereas the ODS inference code that was there before, and xlrd which is being hit in many of the existing code paths already does seek without issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous ODS code read at most 84 bytes; current code needs the entire file. I'll do a partial revert here and utilize the previous ODS code, but I have reservations about downloading the entire file here. Would like to hear others' thoughts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah; I think I have falsely assumed we need to get the entire contents of the file. I think it should be possible to get BufferedIOBase/RawIOBase into the proper form for ZipFile without reading.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did put quite a lot of effort into #38424; the code in that PR is the way it is from, as carefully as I could, following through the various code paths and ensuring the behaviour was as simple and robust as it could be, in spite of less automated testing than I was expecting. It's tough to see these kind of comments which sort of imply that I hadn't thought any of this through...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In no way shape or form did I have any intention of implying such a thing. As my comment above says, I was mistaken.

@cjw296
Copy link
Contributor

cjw296 commented Dec 19, 2020

I believe this would also replace #38456.

I find the desire to support xlrd 1.2 at a time when pandas users are already upgrading a package to be both surprising and disappointing, given the potential security issues and poor parsing experience associated with sticking with xlrd 1.2.

This PR has code that is more complex than #38522, even ignoring that which is in place to support the convoluted deprecation process it advocates for. I haven't seen code coverage metrics as part of the pandas PR process, but are you sure all the new conditional branches you've introduced are covered by sufficient tests?

In case it wasn't clear in #38522, I believe the most robust approach in this area is to:

  • obtain the stream we're going to end up with anyway
  • peek the minimum number of bytes from it to do content-based inference
  • .seek(0) to get the stream back to its initial state.

@rhshadrach
Copy link
Member Author

Thanks for the comments @cjw296, I was able to simplify/improve a lot from them.

@cjw296
Copy link
Contributor

cjw296 commented Dec 19, 2020

@rhshadrach - I'm confused as to why you didn't just start with #38522 and add the fallback logic you're advocating for. (corrected)

@rhshadrach
Copy link
Member Author

@cjw296

I'm confused as to why you didn't just start with #38424 and add the fallback logic you're advocating for.

I think you mean your PRs #38456/#38522. I had started this work 7 days ago, prior to the existence of them.

@rhshadrach
Copy link
Member Author

2 linux tests failed to start, all other checks passed. Rerunning.

@rhshadrach
Copy link
Member Author

/azp run

@azure-pipelines
Copy link
Contributor

Azure Pipelines successfully started running 1 pipeline(s).

@rhshadrach
Copy link
Member Author

rhshadrach commented Dec 19, 2020

Update: Failure if from one of the builds using zh_CN.utf8

Failure is related, though I don't understand it:

>               handle = open(handle, ioargs.mode)
E               FileNotFoundError: [Errno 2] 没有那个文件或目录: 'foo.xlsm'

[snip]

with pytest.raises(FileNotFoundError, match="No such file or directory"):
>           pd.read_excel(bad_file)
E           AssertionError: Regex pattern 'No such file or directory' does not match "[Errno 2] 没有那个文件或目录: 'foo.xlsm'".

@jreback jreback added the IO Excel read_excel, to_excel label Dec 21, 2020
@jreback jreback added this to the 1.2 milestone Dec 21, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas/io/excel/_base.py Outdated Show resolved Hide resolved
@jreback jreback mentioned this pull request Dec 21, 2020
@jorisvandenbossche
Copy link
Member

Regardless of which PR was started first, it's also an option to update #38522 to use the behaviour we want (fallback to xlrd if openpyxl is not installed, for xlrd < 2.0, see also #38522 (review), I think it should be fairly straightforward to add that to that PR)

@jreback
Copy link
Contributor

jreback commented Dec 21, 2020

Regardless of which PR was started first, it's also an option to update #38522 to use the behaviour we want (fallback to xlrd if openpyxl is not installed, for xlrd < 2.0, see also #38522 (review), I think it should be fairly straightforward to add that to that PR)

ok sure. let's try to get this in today and cut the release. i view better inference as good but shouldn't hold this up.

@rhshadrach
Copy link
Member Author

@jreback @jorisvandenbossche Is there room for improving the inference done here? As far as I can tell, the inference done here is that of #38522 but also handles bytes and urls (which I believe is causing failures in #38522).

@rhshadrach
Copy link
Member Author

Two of the travis builds failed to start, the third one passed. Restarted manually.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very minor comment. pls commetn / update

pandas/tests/io/excel/test_writers.py Outdated Show resolved Hide resolved
pandas/io/excel/_base.py Outdated Show resolved Hide resolved
@rhshadrach
Copy link
Member Author

@rhshadrach

https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=50726&view=logs&j=404760ec-14d3-5d48-e580-13034792878f&t=f81e4cc8-d61a-5fb8-36be-36768e5c561a looks like a legit failure

@jreback - Agreed, it is, but I don't understand it. The warning is emitted when openpyxl is not installed but xlrd < 2.0 is and a non-xls file is used. Replicating this in my environment, the test passes with the warning emitted with the right stacklevel (checked manually). So maybe this is a Windows vs Linux issue?

What really doesn't make sense is that actual_warning.filename is pandas\io\excel\_base.py where the Warning is emitted, but the stacklevel is very clearly being set to either be 2 or 4. I don't see how it's possible that _base.py is then the filename.

Attempting to debug via CI now.

@jorisvandenbossche
Copy link
Member

If it only fails on windows, I think it is also fine to add a check_stacklevel=False to the particular test ..

@rhshadrach
Copy link
Member Author

If it only fails on windows, I think it is also fine to add a check_stacklevel=False to the particular test ..

Will do, is this a known issue with Windows? Couldn't find an issue on github, will add one if it is.

@jorisvandenbossche
Copy link
Member

Actually, it's a test asserting warnings about positional arguments, not about the engine. So that might interfere with the test, because it now also raises another warning than the one it is testing?

@rhshadrach
Copy link
Member Author

@jorisvandenbossche Yes, that appears to me to be the reason why it's failing now. However, it passes locally for me and stepping through the assert code, I don't see any reason why it might not work on Windows. But most likely I'm missing some peculiarity about windows..

@jorisvandenbossche
Copy link
Member

I think it is certainly fine to skip the check for stacklevel here (or mayble only skip it on windows, so it's still tested generally)

@jorisvandenbossche
Copy link
Member

@rhshadrach sorry, I merged #38456, which will give some merge conflict pains. I can also deal with it if you like

…rd_warnings

� Conflicts:
�	doc/source/whatsnew/v1.2.0.rst
�	pandas/io/excel/_base.py
@rhshadrach
Copy link
Member Author

@jorisvandenbossche - Not a problem. The merge was fairly straightforward, but your review would be appreciated to make sure I didn't mess any of it up. (I am also double checking it now)

The stacklevel issue was a bug in one of my previous PRs where the file path was hardcoded using '/'. Should be fixed now. Also your requested test has been added.

@jorisvandenbossche
Copy link
Member

Doc changes look good after the merge!


def test_corrupt_bytes_raises(self, read_ext, engine):
bad_stream = b"foo"
with pytest.raises(BadZipFile, match="File is not a zip file"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also be left for a follow-up (time to get this merged ;-)), but this is not a super clear error message in case you accidentally pass a non-excel file to read_excel

We should probably check in the inspect function if it is a zip file (like is also done here: https://github.com/pandas-dev/pandas/pull/38522/files#diff-63200ddb7f5656b8ee868a28d9cb7720ffe50689b0e3fb0b4e15cc5c0ae80dd7R942), and if not raise an error like "file is not recognized as an excel file" or so.

@jreback
Copy link
Contributor

jreback commented Dec 23, 2020

@rhshadrach unfortunately a couple of legit looking failures. note happy to xfail those tests for those versions is fine.

@rhshadrach
Copy link
Member Author

@jreback - the test suggested by @jorisvandenbossche detected that we were not defaulting to pyxlsb for xlsb files when engine is None; opened #38667 and xfailed the test.

@jreback
Copy link
Contributor

jreback commented Dec 23, 2020

@jreback - the test suggested by @jorisvandenbossche detected that we were not defaulting to pyxlsb for xlsb files when engine is None; opened #38667 and xfailed the test.

excellent, ping on green-ish

@jreback jreback merged commit 263e1ee into pandas-dev:master Dec 23, 2020
@jreback
Copy link
Contributor

jreback commented Dec 23, 2020

thanks @rhshadrach amazing job here.

and thank you @cjw296 for your PR and the input.

@jreback
Copy link
Contributor

jreback commented Dec 23, 2020

@meeseeksdev backport 1.2.x

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Dec 23, 2020
@rhshadrach rhshadrach deleted the xlrd_warnings branch December 23, 2020 23:02
jreback pushed a commit that referenced this pull request Dec 24, 2020
…38670)

Co-authored-by: Richard Shadrach <45562402+rhshadrach@users.noreply.github.com>
@jorisvandenbossche
Copy link
Member

Thanks @rhshadrach !

luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

shift default excel read engine from xlrd to openpyxl
4 participants