Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shift default excel read engine from xlrd to openpyxl #38424

Closed
leo4183 opened this issue Dec 12, 2020 · 24 comments · Fixed by #38571
Closed

shift default excel read engine from xlrd to openpyxl #38424

leo4183 opened this issue Dec 12, 2020 · 24 comments · Fixed by #38571
Assignees
Labels
Blocker Blocking issue or pull request for an upcoming release Deprecate Functionality to remove in pandas Docs IO Excel read_excel, to_excel
Milestone

Comments

@leo4183
Copy link

leo4183 commented Dec 12, 2020

since v2.0.0, xlrd no longer supports excel files other than ".xls". manually specify pd.read_excel engine (to openpyxl and etc) is a bit annoying (otherwise pandas would complain about the missing xlrd module). is it possible to shift the default engine to sth more commonly used (eg. openpyxl )

@leo4183 leo4183 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 12, 2020
@rhshadrach
Copy link
Member

Thanks for raising this, I was not aware of the recent changes to xlrd. In pandas 1.2, if you have openpyxl installed it will be used as the default for reading non-xls files. See https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.2.0.html for more info.

With the change to xlrd, I think the messages in the whatsnew and potentially warning messages need to be updated. cc @jorisvandenbossche @jreback

@rhshadrach rhshadrach added Deprecate Functionality to remove in pandas Docs and removed Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 12, 2020
@rhshadrach rhshadrach added this to the 1.2 milestone Dec 12, 2020
@rhshadrach rhshadrach added the IO Excel read_excel, to_excel label Dec 12, 2020
@jorisvandenbossche
Copy link
Member

@rhshadrach see #28547 (comment) and my answer below. I think we indeed should check for this version in pandas, and update the error message (and whatsnew message) to make it clearer.

@cjw296
Copy link
Contributor

cjw296 commented Dec 12, 2020

I'm sorry for causing you folk extra work :-(

@rhshadrach rhshadrach self-assigned this Dec 12, 2020
@simonjayhawkins simonjayhawkins added the Blocker Blocking issue or pull request for an upcoming release label Dec 13, 2020
@simonjayhawkins
Copy link
Member

@rhshadrach see #28547 (comment) and my answer below. I think we indeed should check for this version in pandas, and update the error message (and whatsnew message) to make it clearer.

added the blocker label.

@rhshadrach
Copy link
Member

rhshadrach commented Dec 13, 2020

@jorisvandenbossche My understanding is that, other than ods-streams, only xlrd can read buffer-like values (i.e. non-path) of path_or_buffer in read_excel. As such, one thought is (roughly) to define:

is_xlrd_only = "(non-ods buffer) or (xls-type extension) or (xlrd.Book instance)"

Then:

  • Emit a FutureWarning if:

    • engine was passed as None;
    • not is_xlrd_only;
    • openpyxl is not installed (as would use openpyxl); and
    • and xlrd_version < 2
  • Emit a ValueError anytime (even if engine was not passed as None):

    • engine was determined to be "xlrd";
    • not is_xlrd_only; and
    • xlrd_version >= 2

One case I want to highlight is when read_excel is passed a buffer to a non-xls/ods format. With the logic above, we'd be passing this onto xlrd without warning. I don't know if there is a good way to tell if the buffer can handled by xlrd.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 13, 2020

only xlrd can read buffer-like values (i.e. non-path)

But only if it is a buffer of a file type it does support? Eg with xlrd >= 2.0, it won't read a buffer-like of an xlsx file?

Emit a ValueError anytime (even if engine was not passed as None): ....

Yes, that sounds correct, because for those conditions, xlrd (>= 2.0) would error anyway (I think), and so we want to raise the error, so we can give it a pandas-specific error message.

One case I want to highlight is when read_excel is passed a buffer to a non-xls/ods format. With the logic above, we'd be passing this onto xlrd without warning. I don't know if there is a good way to tell if the buffer can handled by xlrd.

Yeah, see @cjw296's comment here: #28547 (comment). He mentions an inspect_format which might help here. That could maybe be used if xlrd is installed, and if it's version >= 2.0, so we can check if the buffer is readable by xlrd or not.

@rhshadrach
Copy link
Member

Thanks @jorisvandenbossche, I had missed the inspect_format part. I think that solves the part where I had reservations.

@cjw296
Copy link
Contributor

cjw296 commented Dec 13, 2020

@rhshadrach - I'm the maintainer of xlrd, so I wanted to be clear on the intention here: xlrd should only be used for .xls files, please can you emit a warning whenever it is used for anything else, no matter what version of xlrd is installed?

It absolutely should not be used for opening .xlsx files, or any other form of zip/xml files: there are security vulnerabilities lurking, it will break on Python 3.9 and it will fail at parsing some files which openpyxl will do a much better job of.

Please, I'm begging you, do not put in hacks which try and suggest users can get away with "sticking with xlrd 1.2" or whatever...

@jorisvandenbossche
Copy link
Member

please can you emit a warning whenever it is used for anything else?

That's already the case on master / on 1.2rc, see #35029

The whatsnew notice about this (the first warning box in the 1.2 release notes) can be seen here: https://pandas.pydata.org/docs/dev/whatsnew/v1.2.0.html
Specific feedback on the messaging is certainly welcome.

@jorisvandenbossche
Copy link
Member

Actually, re-reading that, I suppose we do not raise a warning when you manually specify engine="xlrd" but are reading a non-xls file. This is not the default (the default of engine=None will either use openpyxl if installed, or otherwise use xlrd with warning), and in the past we always used xlrd automatically in this case, so I don't expect many users to manually specify this, but that's still a case where we should warn as well, then.

@cjw296
Copy link
Contributor

cjw296 commented Dec 13, 2020

I've thrown in #38456 for my preferred wording related to this, and also how I think pandas should behave, but to summarise here:

  • if .xls, then use xlrd.
  • if engine='xlrd' explicitly specified, just let it through. A nice to have here would be to warn if it's being used for a non-'.xls' file.
  • in other cases, use a more appropriate library or throw an exception that the file cannot be safely parsed.

FWIW, you may well want to copy the inspection code and the one constant it needs into pandas, for which you have my explicit permission. That would allow you to dispatch bytes streams using the same logic I describe above.

@jorisvandenbossche
Copy link
Member

how I think pandas should behave, but to summarise here:

As far as I understand it, that's exactly how pandas now behaves (except for the potential warning if engine="xlrd" is specified explicitly but used for non-xls files)

@cjw296
Copy link
Contributor

cjw296 commented Dec 13, 2020

As far as I understand it, that's exactly how pandas now behaves (except for the potential warning if engine="xlrd" is specified explicitly but used for non-xls files)

I'm afraid not:

  • this means that if the file has no extension, or bytes rather than a path are passed, xlrd will be preferred.

  • this defaults to 'xlrd', whereas I would prefer it raise an "unsupported exception".

That second one might be made more palatable by using the code I linked to in order to pick a better engine, in the case where you're trying to read a byte stream or from a file that has an incorrect or not-present file extension.

@jorisvandenbossche
Copy link
Member

this means that if the file has no extension, or bytes rather than a path are passed, xlrd will be preferred.

Yes, that's the one case that @rhshadrach raised above as well (#38424 (comment)), and for which we gladly use the code in xlrd to better detect whether it is actually a xls file or not.

this defaults to 'xlrd', whereas I would prefer it raise an "unsupported exception".

As I also just commented in #38456 about this: in pandas we decided to do this change with a deprecation warning. So we will raise an exception in the future when xlrd gets being used for non-xls files; but for now only a warning.

I should have been clearer, and say "that's exactly how pandas now intents to behave" (as far as I understand it), so taking into account that we first do a warning before raising an exception, and except for the corner cases being discussed in this issue (for buffer objects we don't yet check the exact file type + also when specifying engine="xlrd" explicitly, we should probably warn if used with non-xls files)

@cjw296
Copy link
Contributor

cjw296 commented Dec 13, 2020

Just to echo what I said in the PR comments:

  • In an ideal world, I would prevent anyone using xlrd < 2 from now onwards. It's not a good idea, even for .xls, and it's a positively dangerous idea for .xlsx.

  • Moving to pandas 1.2 would require a user to upgrade pandas, at the very least. If a user is already upgrading one package, it would be a perfect time to upgrade to xlrd 2 and/or openpyxl.

If I worked up a PR to do the guessing xlrd.inspect_format does, but in the pandas codebase, in return for warning whenever anyone uses xlrd < 2 and raising an exception, rather than using xlrd as the default, would it be merged for the pandas 1.2 release?

@jorisvandenbossche
Copy link
Member

@jreback said (#38456 (review)):

we still have the question of what happens when > 2 xlrd is installed and our error conditions are triggered? IIUC this will simply raise the value error from inside xlrd, which is fine as the message is helpful, but are we also triggering the warning? maybe we should catch the exception and raise a nice warning say you should just use openpyxl?

do we have testing with > 2 xlrd?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 15, 2020

what happens when > 2 xlrd is installed and our error conditions are triggered? IIUC this will simply raise the value error from inside xlrd, which is fine as the message is helpful, but are we also triggering the warning? maybe we should catch the exception and raise a nice warning say you should just use openpyxl?

Indeed, as mentioned above, I think we should check for this case, so we can raise a custom, pandas specific error message that is more informative for the pandas case (pandas supports other formats like xlsx, the error message from xlrd could give the impression that we don't)


But so the main question we need to answer is: do we want to allow users to still use xlrd (< 2.0) to read non-xls files, with warning that this is deprecated? (more or less what is now implemented in master) Or do we directly want to disallow this and raise an exception?

@cjw296 (xlrd maintainer) argues for directly disallowing, given the potential security issues.
And the reason we currently only deprecate it, is to give users a smoother upgrade experience: updating pandas will start raising a warning they should no longer use xlrd (only if openpyxl is not installed, otherwise that will already be used automatically), but not break their code.

I am personally leaning towards keeping the warning (but we could make the message stronger about the risks of further using it), but happy to go with the majority.

cc @WillAyd @roberthdevries @bashtage

@cjw296
Copy link
Contributor

cjw296 commented Dec 16, 2020

I've opened #38522, so let's follow up there.

@jorisvandenbossche
Copy link
Member

Let's keep the actual discussion on what general behaviour we want here (it's already scattered enough), specific code review can of course happen in #38522

@cjw296
Copy link
Contributor

cjw296 commented Dec 16, 2020

From #38522: I've gone ahead and implemented the content-based engine inference I suggested, and I believe the outcome is now better all round:

  • .xlsb files will now pick the correct engine rather than falling through and returning an obscure error from xlrd
  • engines will now be correctly picked, even if the file is mis-named. (this happens more than many might realise: I wonder how many web sites are still serving up html tables at urls with .xls extensions because Excel will open this)
  • should a file be indecipherable, we'll still try based on file extension (there are some truly ancient excel files where this applies, but I couldn't find a .xls to add as a test...)
  • you'll now always get simple, clear optional dependency exceptions telling you which library you need to install.
  • you can still explicitly specify xlrd as the engine if you want to use if really want to stick on xlrd 1.2 and experience the bugs and security risks associated with it.
  • the code is, well, in my opinion at least, a lot simpler and clearer.

@jorisvandenbossche
Copy link
Member

@cjw296 Thanks a lot for that! That's a nice clean-up / improvement in any case, regardless of whether we decide to fallback from openpyxl to xlrd or not (in case openpyxl is not installed)

I am hoping that some more people chime in here with their thoughts on this. As long as that doesn't happen, I don't want to decide changing the behaviour we added in the #35029 (if openpyxl is not installed, still continue to use xlrd to backcompat, but with warning about it being deprecated and the security issues).

@jreback
Copy link
Contributor

jreback commented Dec 16, 2020

about this comment

But so the main question we need to answer is: do we want to allow users to still use xlrd (< 2.0) to read non-xls files, with warning that this is deprecated? (more or less what is now implemented in master) Or do we directly want to disallow this and raise an exception?

I would say absolutely yes. just because you updated pandas doesn't mean we need to break because you don't update xlrd. these are two separate things and we don't hard break like this. While I appreciate @cjw296 desire to not support xlrd anymore which is totally fine. There are a great many intallations out there and its not great to just 'break them'. We are already deprecating this, so I don't think much to do (beyond handling the situation if xlrd >2 is installed, which its clear, we must break and point to openpyxl).

@cjw296
Copy link
Contributor

cjw296 commented Dec 19, 2020

Just to echo my comment on #38571: I find the desire to support xlrd 1.2 at a time when pandas users are already upgrading a package to be both surprising and disappointing, given the potential security issues and poor parsing experience associated with sticking with xlrd 1.2.

This isn't a "hard break", no code changes are required for users, simply installing one more package at a time when you're already upgrading other packages.

It's probably worth noting too that bending over backwards to support a broken and insecure version of a package is going to leave pandas with more convoluted code in this area going forwards..

@gfyoung
Copy link
Member

gfyoung commented Dec 22, 2020

This is a tough call, but strictly from a package maintenance perspective, I'm inclined to take the more conservative approach that @jreback is advocating. Although it is extremely belated, an API change like this still risks downstream breakage, which is likely to cause more chaos and headache.

It's probably worth noting too that bending over backwards to support a broken and insecure version of a package is going to leave pandas with more convoluted code in this area going forwards.

Perhaps, but moments like those also present themselves as opportunities to get people to move over to openpyxl. Rather than acquiesce and try to fix with xlrd, challenge them first to use the preferred engine and then evaluate afterwards.

#38522 (comment)

Largely agree with what @jorisvandenbossche is saying, but I would also add that the "lukewarm" response @cjw296 you are receiving isn't because we are not interested in making the switch. We are trying to execute it carefully to minimize issues. The fact we were unaware of it for so long is unfortunate, but from a strategic POV, it also is a sunk cost at this point.

Deprecations in pandas are tracked and are eventually rectified in future releases. This deprecation cycle is not that different from others, and it will get done once we merge the PR's into master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Blocking issue or pull request for an upcoming release Deprecate Functionality to remove in pandas Docs IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants