Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

curate: Accept Excel and OpenOffice metadata files #1550

Merged
merged 4 commits into from
Jul 22, 2024

Conversation

tsibley
Copy link
Member

@tsibley tsibley commented Jul 18, 2024

Reading Excel spreadsheets directly will be helpful for users (and ourselves!) to avoid repeated manual conversions and ad-hoc Excel → TSV programs. OpenOffice supports comes along for the ride with Calamine, the Rust library we're using (via unofficial Python bindings) to read the files.

Only the first sheet in the workbook is read, and how it's read is not configurable. In review of previous Excel files we've used, this is sufficient functionality. We can extend support across alternate sheets or multiple sheets or parts of sheets in the future, if needed.

Using Calamine means that we can support .xls and .xlsx with a single library instead of using both xlrd and openpyxl. The latter is certainly doable—I've done it that way many times in the past; it's also what Pandas does—and our needs a pretty basic so they'd suffice, but still, a zero deps, single library, Rust-underneath solution is nice, and I was pleased to find it. If we find out Calamine is lacking in some way, we can still fallback to the tried and true xlrd + openpyxl combo.

Resolves: #1487

Checklist

  • Automated checks pass
  • Check if you need to add a changelog message
  • Check if you need to add tests
  • Check if you need to update docs

@tsibley tsibley requested review from jameshadfield, joverlee521 and a team July 18, 2024 05:45
@tsibley tsibley changed the base branch from master to trs/json-serialize-dates July 18, 2024 05:45
Base automatically changed from trs/json-serialize-dates to master July 18, 2024 06:00
These tests have nothing directly to do with stdin per se: they're
testing whether read_table_to_dict() accepts a handle or not.  There's
no need to override sys.stdin, as it's not used within
read_table_to_dict() or the nested call to open_file().

Note that open_file("-") _will_ use sys.stdin and read_table_to_dict(x)
calls open_file(x), so you might think that we do sometimes use
sys.stdin and should instead (or additionally) actually test that.
However, our only real calls to read_table_to_dict() originate from
augur/curate/__init__.py, which translates a filename of "-" to
sys.stdin itself before open_file() ever sees it.
Reading Excel spreadsheets directly will be helpful for users (and
ourselves!) to avoid repeated manual conversions and ad-hoc Excel → TSV
programs.  OpenOffice supports comes along for the ride with Calamine,
the Rust library we're using (via unofficial Python bindings) to read
the files.

Only the first sheet in the workbook is read, and how it's read is not
configurable.  In review of previous Excel files we've used, this is
sufficient functionality.  We can extend support across alternate sheets
or multiple sheets or parts of sheets in the future, if needed.

Using Calamine means that we can support .xls and .xlsx with a single
library instead of using both xlrd and openpyxl.  The latter is
certainly doable—I've done it that way many times in the past; it's also
what Pandas does—and our needs a pretty basic so they'd suffice, but
still, a zero deps, single library, Rust-underneath solution is nice,
and I was pleased to find it.  If we find out Calamine is lacking in
some way, we can still fallback to the tried and true xlrd + openpyxl
combo.

Resolves: <#1487>
@tsibley
Copy link
Member Author

tsibley commented Jul 18, 2024

Fixed flake8 linting error (unused import in test file) and added changelog entry.

Copy link

codecov bot commented Jul 18, 2024

Codecov Report

Attention: Patch coverage is 95.83333% with 2 lines in your changes missing coverage. Please review.

Project coverage is 70.22%. Comparing base (556aa28) to head (981c740).

Files Patch % Lines
augur/io/metadata.py 95.23% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1550      +/-   ##
==========================================
+ Coverage   70.13%   70.22%   +0.08%     
==========================================
  Files          74       74              
  Lines        7922     7952      +30     
  Branches     1938     1945       +7     
==========================================
+ Hits         5556     5584      +28     
- Misses       2081     2082       +1     
- Partials      285      286       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tsibley tsibley added the enhancement New feature or request label Jul 18, 2024
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks & works great for simple Excel files!
Commented with small nit and future improvement, but nothing blocking.

augur/io/metadata.py Show resolved Hide resolved
augur/io/metadata.py Outdated Show resolved Hide resolved
augur/io/metadata.py Show resolved Hide resolved
Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works with the excel files I have from various collaborations

Sheets may be hidden (or even "very hidden"), and we'll skip those.
Sheets may also not be worksheets but be chart sheets, macro sheets, and
more, so we'll skip those as well.  Hidden sheets seem more likely to
crop up in practice, but we're filtering anyway so it's not much more
effort to be precise about the sheet type too.

Additionally, clarify in documentation (and code) the behaviour of
skipping initial empty rows/columns in the sheet, as @joverlee521 made
the good suggestion to do in review.
@tsibley tsibley requested a review from joverlee521 July 19, 2024 06:35
These object types may be produced by our Excel/OpenOffice spreadsheet
reader, Calamine, and we want to be able to serialize them without
error.

I missed adding these in "io.json: Serialize date objects as ISO 8601
too" (e20f920).
@tsibley tsibley merged commit be94e50 into master Jul 22, 2024
23 checks passed
@tsibley tsibley deleted the trs/curate/excel-input branch July 22, 2024 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

augur curate: Support Excel inputs
6 participants