augur curate with I/O options #1039

joverlee521 · 2022-09-09T23:47:28Z

Description of proposed changes

Adds the basic framework for augur curate that includes the simples subcommand normalize-strings and the shared I/O options for all subcommands. See commits for details.

Related issue(s)

Related to #860
Closes #998

Testing

Added unit tests for the new I/O functions and functional tests for augur curate normalize-strings and its I/O options.

Checklist

Add a message in CHANGES.md summarizing the changes in this PR. Keep headers and formatting consistent with the rest of the file.

codecov · 2022-09-09T23:58:34Z

Codecov Report

Base: 61.83% // Head: 63.32% // Increases project coverage by +1.48% 🎉

Coverage data is based on head (42f50e0) compared to base (7f968ce).
Patch coverage: 93.26% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1039      +/-   ##
==========================================
+ Coverage   61.83%   63.32%   +1.48%     
==========================================
  Files          52       57       +5     
  Lines        6331     6628     +297     
  Branches     1558     1632      +74     
==========================================
+ Hits         3915     4197     +282     
- Misses       2141     2147       +6     
- Partials      275      284       +9

Impacted Files	Coverage Δ
augur/__init__.py	`66.66% <ø> (ø)`
augur/argparse_.py	`79.16% <66.66%> (-3.45%)`	⬇️
augur/io/json.py	`85.00% <85.00%> (ø)`
augur/curate/__init__.py	`93.84% <93.84%> (ø)`
augur/io/metadata.py	`96.02% <95.58%> (-3.98%)`	⬇️
augur/curate/normalize_strings.py	`100.00% <100.00%> (ø)`
augur/curate/passthru.py	`100.00% <100.00%> (ø)`
augur/io/sequences.py	`100.00% <100.00%> (ø)`
augur/types.py	`100.00% <100.00%> (ø)`
augur/filter.py	`95.74% <0.00%> (+0.01%)`	⬆️
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

victorlin

Reviewed 862bbb6...516dcc6 today, gonna look at the rest next week!

augur/io/json.py

augur/__init__.py

tests/functional/curate/cram/normalize_strings.t

augur/utils.py

augur/io/metadata.py

augur/utils.py

joverlee521 · 2022-09-20T21:06:14Z

Seeing discussion on Slack about data parsing in fauna that starts with Excel files made me realize Excel files should probably be an accepted metadata input. Not sure if I'll add it to this PR, but did want to note it.

tsibley

Really nice work, @joverlee521! Comprehensive and well thought out and designed. I left a lot of little comments, but the bulk of it is 💯.

augur/io/json.py

augur/curate/__init__.py

augur/curate/normalize_strings.py

tests/functional/curate/cram/metadata-and-fasta-input.t

augur/io/sequences.py

augur/curate/__init__.py

augur/io/metadata.py

Add a new optional `command_attribute` parameter to allow command modules to be assigned to a different attribute name. The default value is `__command__` which allows top level Augur to directly run commands via `args.__command__.run()`. However, the new `augur curate` command will need to do some pre-processing of inputs before calling on its individual subcommands. This change allows us to assign the curate subcommand modules to a different attribute name that can be called within `curate.run()`.

Copied JSON handling functions¹ and related util functions² from seattleflu/id3c in anticipation that we will be using newline-delimited JSON records as default inputs and outputs of the new augur curate command. Copied unmodified for now so that we can see our modifications in version control history. ¹ https://github.com/seattleflu/id3c/blob/bf3a60e0b5eedb55e19924901caa6f21be2230ab/lib/id3c/json.py ² https://github.com/seattleflu/id3c/blob/bf3a60e0b5eedb55e19924901caa6f21be2230ab/lib/id3c/utils.py#L93-L174

Edit docstrings and doctests for JSONDecodeError to reflect the move from id3c to augur.

Refactor the awkward conditional in `load_ndjson` to explicitly skip lines that are empty when requested. Co-authored-by: Thomas Sibley <tsibley@fredhutch.org>

Add the basic framework for the new `augur curate` command that includes the `passthru` subcommand and default NDJSON input/output from stdin and stdout. Future work will add other subcommands. Note that curate subcommands will _not_ be run directly by top level Augur. This is designed this way to allow shared input options to be processed before running the subcommand. This ensures that each subcommand will not have to worry about I/O options. I'm adding the simplest subcommand to `augur curate` so that I can test the general framework of the command and the shared I/O parsers in subsequent commits.

Reads a CSV or TSV file and yields each row as a dictionary. Uses the csv standard library instead of pandas to avoid having to convert the pandas DataFrame to dicts. This is written as a general I/O function, but the intention is to use it to parse metadata inputs for the new `augur curate` command so that records can be easily streamed and modified by each subcommand.

The shared parser will be used to house the shared options for all subcommands. I have not found a "non-hacky" way to add parent parsers outside of the `add_parser` call, so each subcommand must add the shared parser to their own parser within their `register_parser` function. This includes the first shared option for curate commands, which is the metadata input file. The functional test for metadata input run the `passthru` subcommand, but it should cover all future subcommands because the inputs for all subcommands are processed the same way.

Add an enum of string values that represent how a data error should be handled. The values are currently just limited to "error_first", "error_all", "warn" and "silent", which are the common patterns of how we handle data errors in Augur. This was added with the intention to be used to dictate how the new `augur curate` command should handle duplicate records in the following commit.

We do not need to handle de-duplication in this input function, but we can at least report duplicate records if requested. If no id column is provided, it will use the first column of the input file to identify duplicates. The duplicate reporting methods include: * error_first = error on the first duplicate record encountered * error_all = run through all records and error with a message that includes all duplicate ids * warn = emits a warning on every encountered duplicate and prints a final warning message that is a summary of all duplicate ids * silent = ignore duplicates

It is not clear yet if we want to add deduplication as an `augur curate` command, but for now we can at least report duplicate records to the users. Commands will error on the first duplicate by default since we usually expect metadata to not contain duplicate records. Note that when "error_all" is requested, the command will still produce an output. This is because the error will only be raised after the generator has been exhausted. Since `curate` is chain of generators, the input generator will only be exhausted by writing the output. However, users should be able to see the error message if running the command manually and workflow managers such as Snakemake will delete the output if the command exits with an error.

Reads a metadata file with `read_table_to_dict` and updates each record dict with the corresponding sequence from a FASTA file. The FASTA file is read with `pyfastx.Fasta` to create an index for the file to allow for random access of sequences using their sequence id. To ensure that the sequences can be matched with their metadata, the FASTA headers must contain the matching sequence id. The FASTA headers may contain additional description parts after the id, but they will not be used in the matching process. `pyfastx` currently only works for plain and gzipped files, so it will limit the input formats for FASTA files. I think this is acceptable for now instead of building our own indexing library. We can look into extending `pyfastx` to support xz-compressed files in the future. Only complete records with both metadata and a sequence are processed, the rest are skipped.

The function now errors on the first unmatched record by default since we expect the provided metadata and FASTA files to have 1:1 records. However, if it is known that there may be unmatched records, then we can emit warnings to stderr or completely silence the unmatched messages.

The function now errors on records with duplicate sequence ids since we expect the provided metadata and FASTA files to have unique sequence ids. This can be toggled to warnings or silenced. The unmatched and duplicates reporting methods are separate parameters to allow users to fine tune which errors get reported. The reporting is structured to make sure the any warnings are emitted before the error is raised. If both unmatched and duplicate reporting request ERROR_ALL, then they are raised as a single error to give the user feedback on both.

It is not clear yet if we want to add deduplication as an `augur curate` command, but for now we can report duplicate records in both metadata and FASTA files to the users. Commands will error on the first duplicate or unmatched record by default since we usually expect unique records in the metadata and FASTA files that are 1:1. Note that if there are duplicate sequence ids in the FASTA file, pyfastx will always return the first sequence in the file. Note that the command can still produce an output even when an error is raised due to unmatched and duplicate records. However, users should be able to see the error message if running the command manually and workflow managers such as Snakemake will delete the output if the command exits with an error.

Writes an iterator of dict records to an output TSV file. This is written as a general I/O function, but the intention is to use it to write the metadata output for the new `augur curate` command.

Outputs metadata as TSV with the record keys as the column names. I did not provide options to include/exclude certain fields since the user can print TSV to stdout with '-' and pipe the output to tsv-utils or csvtk to select and reorder columns.

Writes sequences to a FASTA file by parsing the sequence id and sequence from an iterator that yields dict records. After writing the sequence to the FASTA file, the function yields a copy of the record without the sequence field so that the metadata can be consumed downstream. This is written as a general I/O function, but the intention is to use it to write the FASTA output for the new `augur curate` command before the records are output to a metadata file.

The FASTA output will remove the sequences from records before yielding them again, so the sequences will not be included in the metadata output. If not used in conjunction with the metadata output, the generator is exhausted to ensure that the command runs to completion.

Previously, `read_metadata_to_dict` only allowed files as inputs because it needed to reset the file position after taking a sample for the csv.Sniffer. This commit allows IO buffer inputs by converting the sample string back to a text stream that is then chained with the original text stream.

This will allow users to chain curate commands with other command such as tsv-utils or csvtk.

Try to be more helpful by adding a hint in error message of how to provide inputs to the command.

This list reflects the order of commands that are listed in the `augur --help` output. This commit moves `curate` towards the top of the list in anticipation that it will be commonly used.

joverlee521 · 2022-11-03T22:00:59Z

I plan to merge this tomorrow if there's no other feedback.

tsibley

🎉

tsibley · 2022-11-04T04:46:44Z

CHANGES.md

@@ -2,6 +2,11 @@

 ## __NEXT__

+### Features
+
+* Add the curate subcommand with two sub-subcommands, passthru and normalize-strings [#1039][] (@joverlee521)


A little jarring how a +1,921 -4 diff can be reduced to a single sentence! :-)

It wouldn't be unwarranted to say a bit more about augur curate here (goals, planned functionality, reason for existing, etc.) if you wanted.

Suggestion: add a file docs/usage/cli/curate.rst and link to the generated doc page from here.

Thanks both! I add docs and a bit more to the changelog.

I added the usage docs for augur curate with subdirectory that contains a separate page for each sub-subcommand since we are expecting to add more commands (see preview).

I also added more to the changelog to link to the new usage docs and also included more features that are all packaged into this PR.

(I didn't add API docs to avoid conflicts with #1087)

victorlin · 2022-11-04T17:27:20Z

CHANGES.md

@@ -2,6 +2,11 @@

 ## __NEXT__

+### Features
+
+* Add the curate subcommand with two sub-subcommands, passthru and normalize-strings [#1039][] (@joverlee521)


Suggestion: add a file docs/usage/cli/curate.rst and link to the generated doc page from here.

Add usage docs for the new augur curate subcommand as a subdirectory to separate the docs for each sub-subcommand because we are expecting to add more in the future. The I/O options are repeated for each sub-subcommand because they use a shared parser. I think of this as feature to show how each sub-subcommand can be a single command. Once we add more sub-subcommands, we can add examples of curate piplines to the main curate docs page. We can also add examples of curate interacting with external tools such as tsv-utils or csvtk.

CHANGES.md

@tsibley

Update the language to reflect that entries in the changelog should only be end user focused changes. Motived by @tsibley's comment¹ in a recent PR. It's nice for external users if our changelog is not bloated by internal development changes! ¹ #1039 (comment)

Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. As part of the augur curate suite of commands, `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". Co-authored-by: Jover Lee <joverlee521@gmail.com>

Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". Co-authored-by: Jover Lee <joverlee521@gmail.com>

Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". This commit also adds a test for the new sub-command and updates the documentation. For testing an upper case to lower case circumflex'd o character conversion, had to use the escaped unicode character Co-authored-by: Jover Lee <joverlee521@gmail.com>

joverlee521 requested a review from a team September 12, 2022 17:21

victorlin reviewed Sep 16, 2022

View reviewed changes

tsibley requested changes Sep 21, 2022

View reviewed changes

joverlee521 mentioned this pull request Oct 3, 2022

Pull data directly from COG-UK Data nextstrain/ncov-ingest#329

Open

joverlee521 and others added 10 commits October 26, 2022 16:46

io/json: Edit docstrings for JSONDecodeError

3cfbedd

Edit docstrings and doctests for JSONDecodeError to reflect the move from id3c to augur.

io/json: refactor load_ndjson

5732b73

Refactor the awkward conditional in `load_ndjson` to explicitly skip lines that are empty when requested. Co-authored-by: Thomas Sibley <tsibley@fredhutch.org>

joverlee521 force-pushed the augur-curate branch from 682d3d3 to 1dd2a37 Compare October 27, 2022 01:23

joverlee521 added 13 commits October 26, 2022 18:25

io/metadata: Add write_records_to_tsv

599ba18

Writes an iterator of dict records to an output TSV file. This is written as a general I/O function, but the intention is to use it to write the metadata output for the new `augur curate` command.

curate: Add output metadata option

70a57f4

Outputs metadata as TSV with the record keys as the column names. I did not provide options to include/exclude certain fields since the user can print TSV to stdout with '-' and pipe the output to tsv-utils or csvtk to select and reorder columns.

curate: Accept '-' as stdin for metadata input

48e6d6e

This will allow users to chain curate commands with other command such as tsv-utils or csvtk.

curate: Update invalid input error message

f64070a

Try to be more helpful by adding a hint in error message of how to provide inputs to the command.

curate: Add normalize-strings subcommand

ee7c155

Move curate towards the top of subcommands list

caad3f5

This list reflects the order of commands that are listed in the `augur --help` output. This commit moves `curate` towards the top of the list in anticipation that it will be commonly used.

joverlee521 force-pushed the augur-curate branch from 3c633b4 to b34cc11 Compare November 3, 2022 19:54

tsibley approved these changes Nov 4, 2022

View reviewed changes

victorlin requested changes Nov 4, 2022

View reviewed changes

joverlee521 force-pushed the augur-curate branch from d5eeba9 to abee195 Compare November 4, 2022 20:20

joverlee521 force-pushed the augur-curate branch from abee195 to acca530 Compare November 4, 2022 20:33

tsibley reviewed Nov 4, 2022

View reviewed changes

CHANGES.md Outdated Show resolved Hide resolved

Update changelog

42f50e0

joverlee521 force-pushed the augur-curate branch from acca530 to 42f50e0 Compare November 4, 2022 22:28

joverlee521 mentioned this pull request Nov 4, 2022

Update PR template changelog checklist #1089

Merged

joverlee521 merged commit c067b79 into master Nov 4, 2022

joverlee521 deleted the augur-curate branch November 4, 2022 22:57

joverlee521 mentioned this pull request Nov 5, 2022

Expand augur parse #990

Closed

victorlin mentioned this pull request Nov 7, 2022

Replace FilterException with AugurError #1093

Merged

2 tasks

j23414 mentioned this pull request Apr 7, 2023

Curate titlecase #1197

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

augur curate with I/O options #1039

augur curate with I/O options #1039

joverlee521 commented Sep 9, 2022 •

edited

codecov bot commented Sep 9, 2022 •

edited

victorlin left a comment

joverlee521 commented Sep 20, 2022

tsibley left a comment

joverlee521 commented Nov 3, 2022

tsibley left a comment

tsibley Nov 4, 2022

victorlin Nov 4, 2022

joverlee521 Nov 4, 2022

joverlee521 Nov 4, 2022

joverlee521 Nov 4, 2022

victorlin Nov 4, 2022

augur curate with I/O options #1039

augur curate with I/O options #1039

Conversation

joverlee521 commented Sep 9, 2022 • edited

Description of proposed changes

Related issue(s)

Testing

Checklist

codecov bot commented Sep 9, 2022 • edited

Codecov Report

victorlin left a comment

Choose a reason for hiding this comment

joverlee521 commented Sep 20, 2022

tsibley left a comment

Choose a reason for hiding this comment

joverlee521 commented Nov 3, 2022

tsibley left a comment

Choose a reason for hiding this comment

tsibley Nov 4, 2022

Choose a reason for hiding this comment

victorlin Nov 4, 2022

Choose a reason for hiding this comment

joverlee521 Nov 4, 2022

Choose a reason for hiding this comment

joverlee521 Nov 4, 2022

Choose a reason for hiding this comment

joverlee521 Nov 4, 2022

Choose a reason for hiding this comment

victorlin Nov 4, 2022

Choose a reason for hiding this comment

joverlee521 commented Sep 9, 2022 •

edited

codecov bot commented Sep 9, 2022 •

edited