New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
augur curate with I/O options #1039
Conversation
Codecov ReportBase: 61.83% // Head: 63.32% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #1039 +/- ##
==========================================
+ Coverage 61.83% 63.32% +1.48%
==========================================
Files 52 57 +5
Lines 6331 6628 +297
Branches 1558 1632 +74
==========================================
+ Hits 3915 4197 +282
- Misses 2141 2147 +6
- Partials 275 284 +9
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 862bbb6...516dcc6 today, gonna look at the rest next week!
Seeing discussion on Slack about data parsing in fauna that starts with Excel files made me realize Excel files should probably be an accepted metadata input. Not sure if I'll add it to this PR, but did want to note it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice work, @joverlee521! Comprehensive and well thought out and designed. I left a lot of little comments, but the bulk of it is 💯.
Add a new optional `command_attribute` parameter to allow command modules to be assigned to a different attribute name. The default value is `__command__` which allows top level Augur to directly run commands via `args.__command__.run()`. However, the new `augur curate` command will need to do some pre-processing of inputs before calling on its individual subcommands. This change allows us to assign the curate subcommand modules to a different attribute name that can be called within `curate.run()`.
Copied JSON handling functions¹ and related util functions² from seattleflu/id3c in anticipation that we will be using newline-delimited JSON records as default inputs and outputs of the new augur curate command. Copied unmodified for now so that we can see our modifications in version control history. ¹ https://github.com/seattleflu/id3c/blob/bf3a60e0b5eedb55e19924901caa6f21be2230ab/lib/id3c/json.py ² https://github.com/seattleflu/id3c/blob/bf3a60e0b5eedb55e19924901caa6f21be2230ab/lib/id3c/utils.py#L93-L174
Edit docstrings and doctests for JSONDecodeError to reflect the move from id3c to augur.
Refactor the awkward conditional in `load_ndjson` to explicitly skip lines that are empty when requested. Co-authored-by: Thomas Sibley <tsibley@fredhutch.org>
Add the basic framework for the new `augur curate` command that includes the `passthru` subcommand and default NDJSON input/output from stdin and stdout. Future work will add other subcommands. Note that curate subcommands will _not_ be run directly by top level Augur. This is designed this way to allow shared input options to be processed before running the subcommand. This ensures that each subcommand will not have to worry about I/O options. I'm adding the simplest subcommand to `augur curate` so that I can test the general framework of the command and the shared I/O parsers in subsequent commits.
Reads a CSV or TSV file and yields each row as a dictionary. Uses the csv standard library instead of pandas to avoid having to convert the pandas DataFrame to dicts. This is written as a general I/O function, but the intention is to use it to parse metadata inputs for the new `augur curate` command so that records can be easily streamed and modified by each subcommand.
The shared parser will be used to house the shared options for all subcommands. I have not found a "non-hacky" way to add parent parsers outside of the `add_parser` call, so each subcommand must add the shared parser to their own parser within their `register_parser` function. This includes the first shared option for curate commands, which is the metadata input file. The functional test for metadata input run the `passthru` subcommand, but it should cover all future subcommands because the inputs for all subcommands are processed the same way.
Add an enum of string values that represent how a data error should be handled. The values are currently just limited to "error_first", "error_all", "warn" and "silent", which are the common patterns of how we handle data errors in Augur. This was added with the intention to be used to dictate how the new `augur curate` command should handle duplicate records in the following commit.
We do not need to handle de-duplication in this input function, but we can at least report duplicate records if requested. If no id column is provided, it will use the first column of the input file to identify duplicates. The duplicate reporting methods include: * error_first = error on the first duplicate record encountered * error_all = run through all records and error with a message that includes all duplicate ids * warn = emits a warning on every encountered duplicate and prints a final warning message that is a summary of all duplicate ids * silent = ignore duplicates
It is not clear yet if we want to add deduplication as an `augur curate` command, but for now we can at least report duplicate records to the users. Commands will error on the first duplicate by default since we usually expect metadata to not contain duplicate records. Note that when "error_all" is requested, the command will still produce an output. This is because the error will only be raised after the generator has been exhausted. Since `curate` is chain of generators, the input generator will only be exhausted by writing the output. However, users should be able to see the error message if running the command manually and workflow managers such as Snakemake will delete the output if the command exits with an error.
682d3d3
to
1dd2a37
Compare
Reads a metadata file with `read_table_to_dict` and updates each record dict with the corresponding sequence from a FASTA file. The FASTA file is read with `pyfastx.Fasta` to create an index for the file to allow for random access of sequences using their sequence id. To ensure that the sequences can be matched with their metadata, the FASTA headers must contain the matching sequence id. The FASTA headers may contain additional description parts after the id, but they will not be used in the matching process. `pyfastx` currently only works for plain and gzipped files, so it will limit the input formats for FASTA files. I think this is acceptable for now instead of building our own indexing library. We can look into extending `pyfastx` to support xz-compressed files in the future. Only complete records with both metadata and a sequence are processed, the rest are skipped.
The function now errors on the first unmatched record by default since we expect the provided metadata and FASTA files to have 1:1 records. However, if it is known that there may be unmatched records, then we can emit warnings to stderr or completely silence the unmatched messages.
The function now errors on records with duplicate sequence ids since we expect the provided metadata and FASTA files to have unique sequence ids. This can be toggled to warnings or silenced. The unmatched and duplicates reporting methods are separate parameters to allow users to fine tune which errors get reported. The reporting is structured to make sure the any warnings are emitted before the error is raised. If both unmatched and duplicate reporting request ERROR_ALL, then they are raised as a single error to give the user feedback on both.
It is not clear yet if we want to add deduplication as an `augur curate` command, but for now we can report duplicate records in both metadata and FASTA files to the users. Commands will error on the first duplicate or unmatched record by default since we usually expect unique records in the metadata and FASTA files that are 1:1. Note that if there are duplicate sequence ids in the FASTA file, pyfastx will always return the first sequence in the file. Note that the command can still produce an output even when an error is raised due to unmatched and duplicate records. However, users should be able to see the error message if running the command manually and workflow managers such as Snakemake will delete the output if the command exits with an error.
Writes an iterator of dict records to an output TSV file. This is written as a general I/O function, but the intention is to use it to write the metadata output for the new `augur curate` command.
Outputs metadata as TSV with the record keys as the column names. I did not provide options to include/exclude certain fields since the user can print TSV to stdout with '-' and pipe the output to tsv-utils or csvtk to select and reorder columns.
Writes sequences to a FASTA file by parsing the sequence id and sequence from an iterator that yields dict records. After writing the sequence to the FASTA file, the function yields a copy of the record without the sequence field so that the metadata can be consumed downstream. This is written as a general I/O function, but the intention is to use it to write the FASTA output for the new `augur curate` command before the records are output to a metadata file.
The FASTA output will remove the sequences from records before yielding them again, so the sequences will not be included in the metadata output. If not used in conjunction with the metadata output, the generator is exhausted to ensure that the command runs to completion.
Previously, `read_metadata_to_dict` only allowed files as inputs because it needed to reset the file position after taking a sample for the csv.Sniffer. This commit allows IO buffer inputs by converting the sample string back to a text stream that is then chained with the original text stream.
This will allow users to chain curate commands with other command such as tsv-utils or csvtk.
Try to be more helpful by adding a hint in error message of how to provide inputs to the command.
This list reflects the order of commands that are listed in the `augur --help` output. This commit moves `curate` towards the top of the list in anticipation that it will be commonly used.
3c633b4
to
b34cc11
Compare
I plan to merge this tomorrow if there's no other feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
CHANGES.md
Outdated
@@ -2,6 +2,11 @@ | |||
|
|||
## __NEXT__ | |||
|
|||
### Features | |||
|
|||
* Add the curate subcommand with two sub-subcommands, passthru and normalize-strings [#1039][] (@joverlee521) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little jarring how a +1,921 -4 diff can be reduced to a single sentence! :-)
It wouldn't be unwarranted to say a bit more about augur curate
here (goals, planned functionality, reason for existing, etc.) if you wanted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: add a file docs/usage/cli/curate.rst
and link to the generated doc page from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks both! I add docs and a bit more to the changelog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the usage docs for augur curate with subdirectory that contains a separate page for each sub-subcommand since we are expecting to add more commands (see preview).
I also added more to the changelog to link to the new usage docs and also included more features that are all packaged into this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I didn't add API docs to avoid conflicts with #1087)
CHANGES.md
Outdated
@@ -2,6 +2,11 @@ | |||
|
|||
## __NEXT__ | |||
|
|||
### Features | |||
|
|||
* Add the curate subcommand with two sub-subcommands, passthru and normalize-strings [#1039][] (@joverlee521) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: add a file docs/usage/cli/curate.rst
and link to the generated doc page from here.
d5eeba9
to
abee195
Compare
Add usage docs for the new augur curate subcommand as a subdirectory to separate the docs for each sub-subcommand because we are expecting to add more in the future. The I/O options are repeated for each sub-subcommand because they use a shared parser. I think of this as feature to show how each sub-subcommand can be a single command. Once we add more sub-subcommands, we can add examples of curate piplines to the main curate docs page. We can also add examples of curate interacting with external tools such as tsv-utils or csvtk.
abee195
to
acca530
Compare
acca530
to
42f50e0
Compare
Update the language to reflect that entries in the changelog should only be end user focused changes. Motived by @tsibley's comment¹ in a recent PR. It's nice for external users if our changelog is not bloated by internal development changes! ¹ #1039 (comment)
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. As part of the augur curate suite of commands, `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". Co-authored-by: Jover Lee <joverlee521@gmail.com>
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". Co-authored-by: Jover Lee <joverlee521@gmail.com>
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". Co-authored-by: Jover Lee <joverlee521@gmail.com>
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". This commit also adds a test for the new sub-command and updates the documentation. For testing an upper case to lower case circumflex'd o character conversion, had to use the escaped unicode character Co-authored-by: Jover Lee <joverlee521@gmail.com>
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". This commit also adds a test for the new sub-command and updates the documentation. For testing an upper case to lower case circumflex'd o character conversion, had to use the escaped unicode character Co-authored-by: Jover Lee <joverlee521@gmail.com>
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". This commit also adds a test for the new sub-command and updates the documentation. For testing an upper case to lower case circumflex'd o character conversion, had to use the escaped unicode character Co-authored-by: Jover Lee <joverlee521@gmail.com>
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". This commit also adds a test for the new sub-command and updates the documentation. For testing an upper case to lower case circumflex'd o character conversion, had to use the escaped unicode character Co-authored-by: Jover Lee <joverlee521@gmail.com>
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". This commit also adds a test for the new sub-command and updates the documentation. For testing an upper case to lower case circumflex'd o character conversion, had to use the escaped unicode character Co-authored-by: Jover Lee <joverlee521@gmail.com>
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". This commit also adds a test for the new sub-command and updates the documentation. For testing an upper case to lower case circumflex'd o character conversion, had to use the escaped unicode character Co-authored-by: Jover Lee <joverlee521@gmail.com>
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". This commit also adds a test for the new sub-command and updates the documentation. For testing an upper case to lower case circumflex'd o character conversion, had to use the escaped unicode character Co-authored-by: Jover Lee <joverlee521@gmail.com>
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". This commit also adds a test for the new sub-command and updates the documentation. For testing an upper case to lower case circumflex'd o character conversion, had to use the escaped unicode character Co-authored-by: Jover Lee <joverlee521@gmail.com>
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". This commit also adds a test for the new sub-command and updates the documentation. For testing an upper case to lower case circumflex'd o character conversion, had to use the escaped unicode character Co-authored-by: Jover Lee <joverlee521@gmail.com>
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields script in the monkeypox repo. The `augur curate normalize` sub-command has already been added based on the same script (#1039). Overall this is part of filling in the gaps in the augur curate suite of commands (#860), specifically addressing issue (#999), and is a follow-up to #1039. `augur curate titlecase` would transform the values of a given metadata field to titlecase. This is useful for normalizing the values of a string that may contain inconsistent capitalization such as "North America" and "north america". This commit also adds a test for the new sub-command and updates the documentation. For testing an upper case to lower case circumflex'd o character conversion, had to use the escaped unicode character Co-authored-by: Jover Lee <joverlee521@gmail.com>
Description of proposed changes
Adds the basic framework for
augur curate
that includes the simples subcommandnormalize-strings
and the shared I/O options for all subcommands. See commits for details.Related issue(s)
Related to #860
Closes #998
Testing
Added unit tests for the new I/O functions and functional tests for
augur curate normalize-strings
and its I/O options.Checklist