Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to xlrd 2.0.0 + openpyxl #3191

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Conversation

carrascomj
Copy link

Related to #2632

Description

xlrd dropped support for anything but xls at 2.0.0 python-excel/xlrd#371. Using the old version may cause security vulnerabilities and potential incorrect parsing. It is also a problem for people that have installed pandas in their environment (with, more likely, openpyxl and xlrd>2.0).

Implementation

  • Use openpyxl for xlsx, the recommended alternative.
  • openpyxl was added as extra requirements and xlrd was upgraded to >=2.0.0.

I would like to know if this is desirable before writing any unitests.

Thanks!

@carrascomj
Copy link
Author

I implemented some tests in the same fashion as the ones for jay. For them to run on CI, openpyxl has to be installed during the appveyor build (here and in the subsequent pip-installs, I guess).

@samukweku samukweku changed the title Upgrade to xlrd 2.0.0 + openpyxl [ENH] Upgrade to xlrd 2.0.0 + openpyxl Nov 1, 2021
@oleksiyskononenko
Copy link
Contributor

oleksiyskononenko commented Nov 1, 2021

@carrascomj thanks for your contribution! Nice to know we can't use xlrd for .xlsx anymore. Let me review it and also see why some tests are failing on jenkins.

@samukweku If you only adjust the title to [ENH] ..., it will still be complicated to find PR's like that later. More effective solution is to assign a label improve. Then all the improvements could be seen as https://github.com/h2oai/datatable/issues?q=%22improve%22+label%3Aimprove

@oleksiyskononenko oleksiyskononenko added the improve Improvement of an existing functionality label Nov 1, 2021
@samukweku samukweku changed the title [ENH] Upgrade to xlrd 2.0.0 + openpyxl Upgrade to xlrd 2.0.0 + openpyxl Nov 1, 2021
@oleksiyskononenko
Copy link
Contributor

@carrascomj do you have any ideas if the issue mentioned here has ever been fixed? Have you had a chance to test openpyxl on some larger files?

@carrascomj
Copy link
Author

@carrascomj do you have any ideas if the issue mentioned here has ever been fixed? Have you had a chance to test openpyxl on some larger files?

Ups, I had not seen that issue nor the PR associated with it, sorry. In terms of the mentioned 10x performance decrease, it was a problem with an edge-case: the xlsx contained a link to another file (see pandas-dev/pandas#35029 (comment)).

With that said, the code in this PR is far from optimal. I will run some benchmarks and see how far openpyxl can go. However, I agree that a custom parser goes more in line with the perfomance (like it was suggested before) expected from datatable.

@carrascomj
Copy link
Author

I tested the performance with master (4 columns, 200000 rows: int, float, datetime, str) and there is a 2x performance decrease, which is consistent with reported decreases in pandas.

The changes on 977af71 do not change performance but simplify the code.

---------------------------------------------------- benchmark: 2 tests ------------------------------------
Name (time in s)        Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
------------------------------------------------------------------------------------------------------------
openpyxl_this       19.3181  20.0605  19.6741  0.3141  19.7758  0.5201       2;0  0.0508       5           1
xlrd_upstream       10.6452  10.8720  10.7272  0.0898  10.7248  0.1125       1;0  0.0932       5           1
------------------------------------------------------------------------------------------------------------

if subpath in wb.sheetnames:
sheetname = subpath
else:
if "/" in subpath:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appveyor is failing in Windows because of the conversion to a path with backslashes.

  1. C:\Path\to\file.xlsx\Sheet\A2:B9 fails (also for xls files with the current xlrd-based parser).
  2. C:\Path\to\file.xlsx\Sheet/A2:B9 also fails, since there is not such a C:\Path\to\file.xlsx\Sheet sheet, aftersplitting.
  3. C:\Path\to\file.xlsx/Sheet/A2:B9 works.

The intended behavior in Windows is to work with (1), right? I guess both xlsx.py and xls.py should check for backslashes if no (sheetname, range2d) pair was found when a subpath is specified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Windows support was added after the excel reading feature, so, you're right, we probably missed the backslash issue. My feeling is that on Windows we should support both types of slashes. What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I guess it would be more consistent for the user if they do not have to change the code for different platforms. I'll give it a try in a day.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue has been fixed as of #3220

I'll give it a try in a day.

Do you still want to run some benchmarks for this PR?

@oleksiyskononenko
Copy link
Contributor

Thanks for testing @carrascomj. "2x performance decrease" seems quite significant...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improve Improvement of an existing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants