-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to xlrd 2.0.0 + openpyxl #3191
Open
carrascomj
wants to merge
10
commits into
h2oai:main
Choose a base branch
from
carrascomj:fix-xlsx
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+175
−13
Open
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
652503f
Upgrade to xlrd 2.0.0 + openpyxl
carrascomj 6e1e59a
Fix starting column case
carrascomj 57131c5
Fix change of valid type in column for xlsx
carrascomj 44f3aa8
Test xlsx with workbooks created from openpyxl
carrascomj 5c54dca
Handle subpaths and ranges correctly for xlsx
carrascomj 5e27892
Deduplicate column names on xlsx test for appveyor
carrascomj d06d198
fix: replace deprecated get_sheet_names on xlsx
carrascomj 16c659d
fix: remove deprecated get_sheet_by_name on xlsx
carrascomj 49796e4
Add openpyxl on appveyor build
carrascomj 977af71
Simplify read_xslx_sheetname
carrascomj File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,5 @@ | ||
numpy | ||
pandas | ||
pyarrow | ||
xlrd<=1.2.0 | ||
xlrd>=2.0.0 | ||
openpyxl |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
#!/usr/bin/env python3 | ||
#------------------------------------------------------------------------------- | ||
# Copyright 2018-2021 H2O.ai | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
#------------------------------------------------------------------------------- | ||
import datatable as dt | ||
|
||
|
||
|
||
def read_xlsx_workbook(filename, subpath): | ||
try: | ||
import openpyxl | ||
except ImportError: | ||
raise dt.exceptions.ImportError( | ||
"Module `openpyxl` is required in order to read Excel file '%s'" | ||
% filename) | ||
|
||
if subpath: | ||
wb = openpyxl.load_workbook(filename, data_only=True) | ||
range2d = None | ||
if subpath in wb.sheetnames: | ||
sheetname = subpath | ||
else: | ||
if "/" in subpath: | ||
sheetname, xlsrange = subpath.rsplit('/', 1) | ||
range2d = xlsrange | ||
if not(sheetname in wb.sheetnames and range2d is not None): | ||
raise ValueError("Sheet `%s` is not found in the XLS file" | ||
% subpath) | ||
ws = wb[sheetname] | ||
result = read_xlsx_worksheet(ws, range2d) | ||
else: | ||
wb = openpyxl.load_workbook(filename, data_only=True) | ||
result = {} | ||
for name, ws in zip(wb.sheetnames, wb): | ||
out = read_xlsx_worksheet(ws) | ||
if out is None: | ||
continue | ||
for i, frame in out.items(): | ||
result["%s/%s/%s" % (filename, name, i)] = frame | ||
|
||
if len(result) == 0: | ||
return None | ||
elif len(result) == 1: | ||
for v in result.values(): | ||
return v | ||
else: | ||
return result | ||
|
||
|
||
|
||
def read_xlsx_worksheet(ws, subrange=None): | ||
if subrange is None: | ||
ranges2d = [ws.calculate_dimension()] | ||
else: | ||
ranges2d = [subrange] | ||
|
||
results = {} | ||
for range2d in ranges2d: | ||
subview = ws[range2d] | ||
# the subview is a tuple of rows, which are tuples of columns | ||
ncols = len(subview[0]) | ||
colnames = [str(n.value) for n in subview[0]] | ||
|
||
outdata = [ | ||
[ | ||
row[i]._value if row[i].data_type != "e" else None | ||
for row in subview[1:] | ||
] | ||
for i in range(ncols) | ||
] | ||
coltypes = [ | ||
str if any(isinstance(cell, str) for cell in col) else None | ||
for col in outdata | ||
] | ||
|
||
# let the frame decide on the types of non-str cols | ||
types = [str if coltype == str else None for coltype in coltypes] | ||
frame = dt.Frame(outdata, names=colnames, types=types) | ||
results[range2d] = frame | ||
return results |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appveyor is failing in Windows because of the conversion to a path with backslashes.
C:\Path\to\file.xlsx\Sheet\A2:B9
fails (also for xls files with the current xlrd-based parser).C:\Path\to\file.xlsx\Sheet/A2:B9
also fails, since there is not such aC:\Path\to\file.xlsx\Sheet
sheet, aftersplitting.C:\Path\to\file.xlsx/Sheet/A2:B9
works.The intended behavior in Windows is to work with (1), right? I guess both xlsx.py and xls.py should check for backslashes if no
(sheetname, range2d)
pair was found when a subpath is specified.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Windows support was added after the excel reading feature, so, you're right, we probably missed the backslash issue. My feeling is that on Windows we should support both types of slashes. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I guess it would be more consistent for the user if they do not have to change the code for different platforms. I'll give it a try in a day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue has been fixed as of #3220
Do you still want to run some benchmarks for this PR?