Upgrade to xlrd 2.0.0 + openpyxl #3191

carrascomj · 2021-10-28T22:18:28Z

Related to #2632

Description

xlrd dropped support for anything but xls at 2.0.0 python-excel/xlrd#371. Using the old version may cause security vulnerabilities and potential incorrect parsing. It is also a problem for people that have installed pandas in their environment (with, more likely, openpyxl and xlrd>2.0).

Implementation

Use openpyxl for xlsx, the recommended alternative.
openpyxl was added as extra requirements and xlrd was upgraded to >=2.0.0.

I would like to know if this is desirable before writing any unitests.

Thanks!

carrascomj · 2021-10-30T13:32:31Z

I implemented some tests in the same fashion as the ones for jay. For them to run on CI, openpyxl has to be installed during the appveyor build (here and in the subsequent pip-installs, I guess).

oleksiyskononenko · 2021-11-01T19:54:43Z

@carrascomj thanks for your contribution! Nice to know we can't use xlrd for .xlsx anymore. Let me review it and also see why some tests are failing on jenkins.

@samukweku If you only adjust the title to [ENH] ..., it will still be complicated to find PR's like that later. More effective solution is to assign a label improve. Then all the improvements could be seen as https://github.com/h2oai/datatable/issues?q=%22improve%22+label%3Aimprove

oleksiyskononenko · 2021-11-04T01:42:03Z

@carrascomj do you have any ideas if the issue mentioned here has ever been fixed? Have you had a chance to test openpyxl on some larger files?

src/datatable/xlsx.py

carrascomj · 2021-11-04T11:16:17Z

@carrascomj do you have any ideas if the issue mentioned here has ever been fixed? Have you had a chance to test openpyxl on some larger files?

Ups, I had not seen that issue nor the PR associated with it, sorry. In terms of the mentioned 10x performance decrease, it was a problem with an edge-case: the xlsx contained a link to another file (see pandas-dev/pandas#35029 (comment)).

With that said, the code in this PR is far from optimal. I will run some benchmarks and see how far openpyxl can go. However, I agree that a custom parser goes more in line with the perfomance (like it was suggested before) expected from datatable.

carrascomj · 2021-11-04T21:46:55Z

I tested the performance with master (4 columns, 200000 rows: int, float, datetime, str) and there is a 2x performance decrease, which is consistent with reported decreases in pandas.

The changes on 977af71 do not change performance but simplify the code.

---------------------------------------------------- benchmark: 2 tests ------------------------------------
Name (time in s)        Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
------------------------------------------------------------------------------------------------------------
openpyxl_this       19.3181  20.0605  19.6741  0.3141  19.7758  0.5201       2;0  0.0508       5           1
xlrd_upstream       10.6452  10.8720  10.7272  0.0898  10.7248  0.1125       1;0  0.0932       5           1
------------------------------------------------------------------------------------------------------------

carrascomj · 2021-11-05T09:10:16Z

src/datatable/xlsx.py

+        if subpath in wb.sheetnames:
+            sheetname = subpath
+        else:
+            if "/" in subpath:


Appveyor is failing in Windows because of the conversion to a path with backslashes.

C:\Path\to\file.xlsx\Sheet\A2:B9 fails (also for xls files with the current xlrd-based parser).

C:\Path\to\file.xlsx\Sheet/A2:B9 also fails, since there is not such a C:\Path\to\file.xlsx\Sheet sheet, aftersplitting.

C:\Path\to\file.xlsx/Sheet/A2:B9 works.

The intended behavior in Windows is to work with (1), right? I guess both xlsx.py and xls.py should check for backslashes if no (sheetname, range2d) pair was found when a subpath is specified.

Windows support was added after the excel reading feature, so, you're right, we probably missed the backslash issue. My feeling is that on Windows we should support both types of slashes. What do you think?

Yes, I guess it would be more consistent for the user if they do not have to change the code for different platforms. I'll give it a try in a day.

This issue has been fixed as of #3220

I'll give it a try in a day.

Do you still want to run some benchmarks for this PR?

oleksiyskononenko · 2021-11-09T18:53:53Z

Thanks for testing @carrascomj. "2x performance decrease" seems quite significant...

carrascomj added 6 commits October 29, 2021 00:06

Upgrade to xlrd 2.0.0 + openpyxl

652503f

Fix starting column case

6e1e59a

Fix change of valid type in column for xlsx

57131c5

Test xlsx with workbooks created from openpyxl

44f3aa8

Handle subpaths and ranges correctly for xlsx

5c54dca

Deduplicate column names on xlsx test for appveyor

5e27892

samukweku requested a review from oleksiyskononenko November 1, 2021 04:46

samukweku assigned carrascomj Nov 1, 2021

samukweku changed the title ~~Upgrade to xlrd 2.0.0 + openpyxl~~ [ENH] Upgrade to xlrd 2.0.0 + openpyxl Nov 1, 2021

oleksiyskononenko added the improve Improvement of an existing functionality label Nov 1, 2021

samukweku changed the title ~~[ENH] Upgrade to xlrd 2.0.0 + openpyxl~~ Upgrade to xlrd 2.0.0 + openpyxl Nov 1, 2021

oleksiyskononenko reviewed Nov 4, 2021

View reviewed changes

src/datatable/xlsx.py Outdated Show resolved Hide resolved

fix: replace deprecated get_sheet_names on xlsx

d06d198

carrascomj added 3 commits November 4, 2021 22:37

fix: remove deprecated get_sheet_by_name on xlsx

16c659d

Add openpyxl on appveyor build

49796e4

Simplify read_xslx_sheetname

977af71

carrascomj commented Nov 5, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to xlrd 2.0.0 + openpyxl #3191

Upgrade to xlrd 2.0.0 + openpyxl #3191

carrascomj commented Oct 28, 2021

carrascomj commented Oct 30, 2021

oleksiyskononenko commented Nov 1, 2021 •

edited

oleksiyskononenko commented Nov 4, 2021

carrascomj commented Nov 4, 2021

carrascomj commented Nov 4, 2021

carrascomj Nov 5, 2021

oleksiyskononenko Nov 9, 2021

carrascomj Nov 9, 2021

oleksiyskononenko Jan 4, 2022 •

edited

oleksiyskononenko commented Nov 9, 2021

Upgrade to xlrd 2.0.0 + openpyxl #3191

Are you sure you want to change the base?

Upgrade to xlrd 2.0.0 + openpyxl #3191

Conversation

carrascomj commented Oct 28, 2021

Description

Implementation

carrascomj commented Oct 30, 2021

oleksiyskononenko commented Nov 1, 2021 • edited

oleksiyskononenko commented Nov 4, 2021

carrascomj commented Nov 4, 2021

carrascomj commented Nov 4, 2021

carrascomj Nov 5, 2021

Choose a reason for hiding this comment

oleksiyskononenko Nov 9, 2021

Choose a reason for hiding this comment

carrascomj Nov 9, 2021

Choose a reason for hiding this comment

oleksiyskononenko Jan 4, 2022 • edited

Choose a reason for hiding this comment

oleksiyskononenko commented Nov 9, 2021

oleksiyskononenko commented Nov 1, 2021 •

edited

oleksiyskononenko Jan 4, 2022 •

edited