-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
curate: Accept Excel and OpenOffice metadata files #1550
Conversation
4941443
to
cdae769
Compare
These tests have nothing directly to do with stdin per se: they're testing whether read_table_to_dict() accepts a handle or not. There's no need to override sys.stdin, as it's not used within read_table_to_dict() or the nested call to open_file(). Note that open_file("-") _will_ use sys.stdin and read_table_to_dict(x) calls open_file(x), so you might think that we do sometimes use sys.stdin and should instead (or additionally) actually test that. However, our only real calls to read_table_to_dict() originate from augur/curate/__init__.py, which translates a filename of "-" to sys.stdin itself before open_file() ever sees it.
Reading Excel spreadsheets directly will be helpful for users (and ourselves!) to avoid repeated manual conversions and ad-hoc Excel → TSV programs. OpenOffice supports comes along for the ride with Calamine, the Rust library we're using (via unofficial Python bindings) to read the files. Only the first sheet in the workbook is read, and how it's read is not configurable. In review of previous Excel files we've used, this is sufficient functionality. We can extend support across alternate sheets or multiple sheets or parts of sheets in the future, if needed. Using Calamine means that we can support .xls and .xlsx with a single library instead of using both xlrd and openpyxl. The latter is certainly doable—I've done it that way many times in the past; it's also what Pandas does—and our needs a pretty basic so they'd suffice, but still, a zero deps, single library, Rust-underneath solution is nice, and I was pleased to find it. If we find out Calamine is lacking in some way, we can still fallback to the tried and true xlrd + openpyxl combo. Resolves: <#1487>
cdae769
to
6eff3e4
Compare
Fixed flake8 linting error (unused import in test file) and added changelog entry. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1550 +/- ##
==========================================
+ Coverage 70.13% 70.22% +0.08%
==========================================
Files 74 74
Lines 7922 7952 +30
Branches 1938 1945 +7
==========================================
+ Hits 5556 5584 +28
- Misses 2081 2082 +1
- Partials 285 286 +1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks & works great for simple Excel files!
Commented with small nit and future improvement, but nothing blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works with the excel files I have from various collaborations
Sheets may be hidden (or even "very hidden"), and we'll skip those. Sheets may also not be worksheets but be chart sheets, macro sheets, and more, so we'll skip those as well. Hidden sheets seem more likely to crop up in practice, but we're filtering anyway so it's not much more effort to be precise about the sheet type too. Additionally, clarify in documentation (and code) the behaviour of skipping initial empty rows/columns in the sheet, as @joverlee521 made the good suggestion to do in review.
These object types may be produced by our Excel/OpenOffice spreadsheet reader, Calamine, and we want to be able to serialize them without error. I missed adding these in "io.json: Serialize date objects as ISO 8601 too" (e20f920).
60fed44
to
981c740
Compare
Reading Excel spreadsheets directly will be helpful for users (and ourselves!) to avoid repeated manual conversions and ad-hoc Excel → TSV programs. OpenOffice supports comes along for the ride with Calamine, the Rust library we're using (via unofficial Python bindings) to read the files.
Only the first sheet in the workbook is read, and how it's read is not configurable. In review of previous Excel files we've used, this is sufficient functionality. We can extend support across alternate sheets or multiple sheets or parts of sheets in the future, if needed.
Using Calamine means that we can support .xls and .xlsx with a single library instead of using both xlrd and openpyxl. The latter is certainly doable—I've done it that way many times in the past; it's also what Pandas does—and our needs a pretty basic so they'd suffice, but still, a zero deps, single library, Rust-underneath solution is nice, and I was pleased to find it. If we find out Calamine is lacking in some way, we can still fallback to the tried and true xlrd + openpyxl combo.
Resolves: #1487
Checklist