Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

augur curate with I/O options #1039

Merged
merged 28 commits into from Nov 4, 2022
Merged

augur curate with I/O options #1039

merged 28 commits into from Nov 4, 2022

Conversation

joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented Sep 9, 2022

Description of proposed changes

Adds the basic framework for augur curate that includes the simples subcommand normalize-strings and the shared I/O options for all subcommands. See commits for details.

Related issue(s)

Related to #860
Closes #998

Testing

Added unit tests for the new I/O functions and functional tests for augur curate normalize-strings and its I/O options.

Checklist

  • Add a message in CHANGES.md summarizing the changes in this PR. Keep headers and formatting consistent with the rest of the file.

@codecov
Copy link

codecov bot commented Sep 9, 2022

Codecov Report

Base: 61.83% // Head: 63.32% // Increases project coverage by +1.48% 🎉

Coverage data is based on head (42f50e0) compared to base (7f968ce).
Patch coverage: 93.26% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1039      +/-   ##
==========================================
+ Coverage   61.83%   63.32%   +1.48%     
==========================================
  Files          52       57       +5     
  Lines        6331     6628     +297     
  Branches     1558     1632      +74     
==========================================
+ Hits         3915     4197     +282     
- Misses       2141     2147       +6     
- Partials      275      284       +9     
Impacted Files Coverage Δ
augur/__init__.py 66.66% <ø> (ø)
augur/argparse_.py 79.16% <66.66%> (-3.45%) ⬇️
augur/io/json.py 85.00% <85.00%> (ø)
augur/curate/__init__.py 93.84% <93.84%> (ø)
augur/io/metadata.py 96.02% <95.58%> (-3.98%) ⬇️
augur/curate/normalize_strings.py 100.00% <100.00%> (ø)
augur/curate/passthru.py 100.00% <100.00%> (ø)
augur/io/sequences.py 100.00% <100.00%> (ø)
augur/types.py 100.00% <100.00%> (ø)
augur/filter.py 95.74% <0.00%> (+0.01%) ⬆️
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@joverlee521 joverlee521 requested a review from a team September 12, 2022 17:21
Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 862bbb6...516dcc6 today, gonna look at the rest next week!

augur/io/json.py Outdated Show resolved Hide resolved
augur/__init__.py Outdated Show resolved Hide resolved
tests/functional/curate/cram/normalize_strings.t Outdated Show resolved Hide resolved
augur/utils.py Outdated Show resolved Hide resolved
augur/io/metadata.py Outdated Show resolved Hide resolved
augur/io/metadata.py Outdated Show resolved Hide resolved
augur/io/metadata.py Show resolved Hide resolved
augur/io/metadata.py Outdated Show resolved Hide resolved
augur/utils.py Outdated Show resolved Hide resolved
@joverlee521
Copy link
Contributor Author

Seeing discussion on Slack about data parsing in fauna that starts with Excel files made me realize Excel files should probably be an accepted metadata input. Not sure if I'll add it to this PR, but did want to note it.

Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice work, @joverlee521! Comprehensive and well thought out and designed. I left a lot of little comments, but the bulk of it is 💯.

augur/io/json.py Outdated Show resolved Hide resolved
augur/curate/__init__.py Show resolved Hide resolved
augur/curate/__init__.py Outdated Show resolved Hide resolved
augur/curate/normalize_strings.py Show resolved Hide resolved
augur/curate/normalize_strings.py Outdated Show resolved Hide resolved
tests/functional/curate/cram/metadata-and-fasta-input.t Outdated Show resolved Hide resolved
augur/io/sequences.py Outdated Show resolved Hide resolved
augur/curate/__init__.py Outdated Show resolved Hide resolved
augur/curate/__init__.py Outdated Show resolved Hide resolved
augur/io/metadata.py Outdated Show resolved Hide resolved
joverlee521 and others added 10 commits October 26, 2022 16:46
Add a new optional `command_attribute` parameter to allow command
modules to be assigned to a different attribute name. The default value
is `__command__` which allows top level Augur to directly run commands
via `args.__command__.run()`. However, the new `augur curate` command
will need to do some pre-processing of inputs before calling on its
individual subcommands. This change allows us to assign the curate
subcommand modules to a different attribute name that can be called
within `curate.run()`.
Copied JSON handling functions¹ and related util functions² from
seattleflu/id3c in anticipation that we will be using newline-delimited
JSON records as default inputs and outputs of the new augur curate
command. Copied unmodified for now so that we can see our modifications
in version control history.

¹ https://github.com/seattleflu/id3c/blob/bf3a60e0b5eedb55e19924901caa6f21be2230ab/lib/id3c/json.py
² https://github.com/seattleflu/id3c/blob/bf3a60e0b5eedb55e19924901caa6f21be2230ab/lib/id3c/utils.py#L93-L174
Edit docstrings and doctests for JSONDecodeError to reflect the move
from id3c to augur.
Refactor the awkward conditional in `load_ndjson` to explicitly
skip lines that are empty when requested.

Co-authored-by: Thomas Sibley <tsibley@fredhutch.org>
Add the basic framework for the new `augur curate` command that includes
the `passthru` subcommand and default NDJSON input/output
from stdin and stdout. Future work will add other subcommands.

Note that curate subcommands will _not_ be run directly by top level
Augur. This is designed this way to allow shared input options to be
processed before running the subcommand. This ensures that each
subcommand will not have to worry about I/O options.

I'm adding the simplest subcommand to `augur curate` so that I can test
the general framework of the command and the shared I/O parsers in
subsequent commits.
Reads a CSV or TSV file and yields each row as a dictionary.
Uses the csv standard library instead of pandas to avoid having to
convert the pandas DataFrame to dicts. This is written as a general
I/O function, but the intention is to use it to parse metadata inputs
for the new `augur curate` command so that records can be easily
streamed and modified by each subcommand.
The shared parser will be used to house the shared options for
all subcommands. I have not found a "non-hacky" way to add parent
parsers outside of the `add_parser` call, so each subcommand must add
the shared parser to their own parser within their `register_parser`
function.

This includes the first shared option for curate commands, which is
the metadata input file. The functional test for metadata input run
the `passthru` subcommand, but it should cover all future subcommands
because the inputs for all subcommands are processed the same way.
Add an enum of string values that represent how a data error should be
handled. The values are currently just limited to "error_first",
"error_all", "warn" and "silent", which are the common patterns of how
we handle data errors in Augur.

This was added with the intention to be used to dictate how the new
`augur curate` command should handle duplicate records in the following
commit.
We do not need to handle de-duplication in this input function, but
we can at least report duplicate records if requested. If no id column
is provided, it will use the first column of the input file to identify
duplicates.

The duplicate reporting methods include:
* error_first = error on the first duplicate record encountered
* error_all = run through all records and error with a message that
includes all duplicate ids
* warn = emits a warning on every encountered duplicate and prints a
final warning message that is a summary of all duplicate ids
* silent = ignore duplicates
It is not clear yet if we want to add deduplication as an `augur curate`
command, but for now we can at least report duplicate records to the
users. Commands will error on the first duplicate by default since we
usually expect metadata to not contain duplicate records.

Note that when "error_all" is requested, the command will still produce
an output. This is because the error will only be raised after the
generator has been exhausted. Since `curate` is chain of generators, the
input generator will only be exhausted by writing the output. However,
users should be able to see the error message if running the command
manually and workflow managers such as Snakemake will delete the output
if the command exits with an error.
Reads a metadata file with `read_table_to_dict` and updates each
record dict with the corresponding sequence from a FASTA file. The FASTA
file is read with `pyfastx.Fasta` to create an index for the file to
allow for random access of sequences using their sequence id. To ensure
that the sequences can be matched with their metadata, the FASTA headers
must contain the matching sequence id. The FASTA headers may contain
additional description parts after the id, but they will not be used in
the matching process.

`pyfastx` currently only works for plain and gzipped files, so it will
limit the input formats for FASTA files. I think this is acceptable
for now instead of building our own indexing library. We can look into
extending `pyfastx` to support xz-compressed files in the future.

Only complete records with both metadata and a sequence are processed,
the rest are skipped.
The function now errors on the first unmatched record by default since
we expect the provided metadata and FASTA files to have 1:1 records.
However, if it is known that there may be unmatched records, then we can
emit warnings to stderr or completely silence the unmatched messages.
The function now errors on records with duplicate sequence ids since we
expect the provided metadata and FASTA files to have unique sequence ids.
This can be toggled to warnings or silenced.

The unmatched and duplicates reporting methods are separate parameters
to allow users to fine tune which errors get reported. The reporting is
structured to make sure the any warnings are emitted before the error
is raised. If both unmatched and duplicate reporting request ERROR_ALL,
then they are raised as a single error to give the user feedback on both.
It is not clear yet if we want to add deduplication as an `augur curate`
command, but for now we can report duplicate records in both metadata
and FASTA files to the users. Commands will error on the first duplicate
or unmatched record by default since we usually expect unique records in
the metadata and FASTA files that are 1:1.

Note that if there are duplicate sequence ids in the FASTA file, pyfastx
will always return the first sequence in the file.

Note that the command can still produce an output even when an error is
raised due to unmatched and duplicate records. However, users should be
able to see the error message if running the command manually and
workflow managers such as Snakemake will delete the output if the
command exits with an error.
Writes an iterator of dict records to an output TSV file.
This is written as a general I/O function, but the intention is to use
it to write the metadata output for the new `augur curate` command.
Outputs metadata as TSV with the record keys as the column names. I did
not provide options to include/exclude certain fields since the user
can print TSV to stdout with '-' and pipe the output to tsv-utils or
csvtk to select and reorder columns.
Writes sequences to a FASTA file by parsing the sequence id and sequence
from an iterator that yields dict records. After writing the sequence
to the FASTA file, the function yields a copy of the record without the
sequence field so that the metadata can be consumed downstream.

This is written as a general I/O function, but the intention is to use
it to write the FASTA output for the new `augur curate` command before
the records are output to a metadata file.
The FASTA output will remove the sequences from records before yielding
them again, so the sequences will not be included in the metadata
output. If not used in conjunction with the metadata output, the
generator is exhausted to ensure that the command runs to completion.
Previously, `read_metadata_to_dict` only allowed files as inputs because
it needed to reset the file position after taking a sample for the
csv.Sniffer.

This commit allows IO buffer inputs by converting the sample string back
to a text stream that is then chained with the original text stream.
This will allow users to chain curate commands with other command such
as tsv-utils or csvtk.
Try to be more helpful by adding a hint in error message of how to
provide inputs to the command.
This list reflects the order of commands that are listed in the
`augur --help` output. This commit moves `curate` towards the top of
the list in anticipation that it will be commonly used.
@joverlee521
Copy link
Contributor Author

I plan to merge this tomorrow if there's no other feedback.

Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

CHANGES.md Outdated
@@ -2,6 +2,11 @@

## __NEXT__

### Features

* Add the curate subcommand with two sub-subcommands, passthru and normalize-strings [#1039][] (@joverlee521)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little jarring how a +1,921 -4 diff can be reduced to a single sentence! :-)

It wouldn't be unwarranted to say a bit more about augur curate here (goals, planned functionality, reason for existing, etc.) if you wanted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: add a file docs/usage/cli/curate.rst and link to the generated doc page from here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both! I add docs and a bit more to the changelog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the usage docs for augur curate with subdirectory that contains a separate page for each sub-subcommand since we are expecting to add more commands (see preview).

I also added more to the changelog to link to the new usage docs and also included more features that are all packaged into this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I didn't add API docs to avoid conflicts with #1087)

CHANGES.md Outdated
@@ -2,6 +2,11 @@

## __NEXT__

### Features

* Add the curate subcommand with two sub-subcommands, passthru and normalize-strings [#1039][] (@joverlee521)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: add a file docs/usage/cli/curate.rst and link to the generated doc page from here.

Add usage docs for the new augur curate subcommand as a subdirectory
to separate the docs for each sub-subcommand because we are expecting
to add more in the future. The I/O options are repeated for each
sub-subcommand because they use a shared parser. I think of this as
feature to show how each sub-subcommand can be a single command.

Once we add more sub-subcommands, we can add examples of curate
piplines to the main curate docs page. We can also add examples of
curate interacting with external tools such as tsv-utils or csvtk.
CHANGES.md Outdated Show resolved Hide resolved
joverlee521 added a commit that referenced this pull request Nov 4, 2022
Update the language to reflect that entries in the changelog should
only be end user focused changes. Motived by @tsibley's comment¹ in a
recent PR.

It's nice for external users if our changelog is not bloated by
internal development changes!

¹ #1039 (comment)
@joverlee521 joverlee521 merged commit c067b79 into master Nov 4, 2022
@joverlee521 joverlee521 deleted the augur-curate branch November 4, 2022 22:57
@joverlee521 joverlee521 mentioned this pull request Nov 5, 2022
@j23414 j23414 mentioned this pull request Apr 7, 2023
1 task
j23414 added a commit that referenced this pull request Apr 19, 2023
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields
script in the monkeypox repo. The `augur curate normalize` sub-command
has already been added based on the same script (#1039).

Overall this is part of filling in the gaps in the augur curate suite of
commands (#860), specifically addressing issue (#999),
and is a follow-up to #1039.

As part of the augur curate suite of commands, `augur curate titlecase`
would transform the values of a given metadata field to titlecase. This
is useful for normalizing the values of a string that may contain
inconsistent capitalization such as "North America" and "north america".

Co-authored-by: Jover Lee <joverlee521@gmail.com>
j23414 added a commit that referenced this pull request Apr 19, 2023
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields
script in the monkeypox repo. The `augur curate normalize` sub-command
has already been added based on the same script (#1039).

Overall this is part of filling in the gaps in the augur curate suite of commands (#860),
specifically addressing issue (#999), and is a follow-up to #1039.

`augur curate titlecase` would transform the values of a given metadata field to titlecase.
This is useful for normalizing the values of a string that may contain inconsistent
capitalization such as "North America" and "north america".

Co-authored-by: Jover Lee <joverlee521@gmail.com>
j23414 added a commit that referenced this pull request Apr 19, 2023
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields
script in the monkeypox repo. The `augur curate normalize` sub-command
has already been added based on the same script (#1039).

Overall this is part of filling in the gaps in the augur curate suite of commands (#860),
specifically addressing issue (#999), and is a follow-up to #1039.

`augur curate titlecase` would transform the values of a given metadata field to titlecase.
This is useful for normalizing the values of a string that may contain inconsistent
capitalization such as "North America" and "north america".

Co-authored-by: Jover Lee <joverlee521@gmail.com>
j23414 added a commit that referenced this pull request Apr 21, 2023
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields
script in the monkeypox repo. The `augur curate normalize` sub-command
has already been added based on the same script (#1039).

Overall this is part of filling in the gaps in the augur curate suite of commands (#860),
specifically addressing issue (#999), and is a follow-up to #1039.

`augur curate titlecase` would transform the values of a given metadata field to titlecase.
This is useful for normalizing the values of a string that may contain inconsistent
capitalization such as "North America" and "north america".

This commit also adds a test for the new sub-command and updates the documentation.
For testing an upper case to lower case circumflex'd o character conversion, had to use
the escaped unicode character

Co-authored-by: Jover Lee <joverlee521@gmail.com>
j23414 added a commit that referenced this pull request Apr 21, 2023
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields
script in the monkeypox repo. The `augur curate normalize` sub-command
has already been added based on the same script (#1039).

Overall this is part of filling in the gaps in the augur curate suite of commands (#860),
specifically addressing issue (#999), and is a follow-up to #1039.

`augur curate titlecase` would transform the values of a given metadata field to titlecase.
This is useful for normalizing the values of a string that may contain inconsistent
capitalization such as "North America" and "north america".

This commit also adds a test for the new sub-command and updates the documentation.
For testing an upper case to lower case circumflex'd o character conversion, had to use
the escaped unicode character

Co-authored-by: Jover Lee <joverlee521@gmail.com>
j23414 added a commit that referenced this pull request Apr 21, 2023
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields
script in the monkeypox repo. The `augur curate normalize` sub-command
has already been added based on the same script (#1039).

Overall this is part of filling in the gaps in the augur curate suite of commands (#860),
specifically addressing issue (#999), and is a follow-up to #1039.

`augur curate titlecase` would transform the values of a given metadata field to titlecase.
This is useful for normalizing the values of a string that may contain inconsistent
capitalization such as "North America" and "north america".

This commit also adds a test for the new sub-command and updates the documentation.
For testing an upper case to lower case circumflex'd o character conversion, had to use
the escaped unicode character

Co-authored-by: Jover Lee <joverlee521@gmail.com>
j23414 added a commit that referenced this pull request Apr 21, 2023
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields
script in the monkeypox repo. The `augur curate normalize` sub-command
has already been added based on the same script (#1039).

Overall this is part of filling in the gaps in the augur curate suite of commands (#860),
specifically addressing issue (#999), and is a follow-up to #1039.

`augur curate titlecase` would transform the values of a given metadata field to titlecase.
This is useful for normalizing the values of a string that may contain inconsistent
capitalization such as "North America" and "north america".

This commit also adds a test for the new sub-command and updates the documentation.
For testing an upper case to lower case circumflex'd o character conversion, had to use
the escaped unicode character

Co-authored-by: Jover Lee <joverlee521@gmail.com>
j23414 added a commit that referenced this pull request Apr 21, 2023
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields
script in the monkeypox repo. The `augur curate normalize` sub-command
has already been added based on the same script (#1039).

Overall this is part of filling in the gaps in the augur curate suite of commands (#860),
specifically addressing issue (#999), and is a follow-up to #1039.

`augur curate titlecase` would transform the values of a given metadata field to titlecase.
This is useful for normalizing the values of a string that may contain inconsistent
capitalization such as "North America" and "north america".

This commit also adds a test for the new sub-command and updates the documentation.
For testing an upper case to lower case circumflex'd o character conversion, had to use
the escaped unicode character

Co-authored-by: Jover Lee <joverlee521@gmail.com>
j23414 added a commit that referenced this pull request Apr 25, 2023
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields
script in the monkeypox repo. The `augur curate normalize` sub-command
has already been added based on the same script (#1039).

Overall this is part of filling in the gaps in the augur curate suite of commands (#860),
specifically addressing issue (#999), and is a follow-up to #1039.

`augur curate titlecase` would transform the values of a given metadata field to titlecase.
This is useful for normalizing the values of a string that may contain inconsistent
capitalization such as "North America" and "north america".

This commit also adds a test for the new sub-command and updates the documentation.
For testing an upper case to lower case circumflex'd o character conversion, had to use
the escaped unicode character

Co-authored-by: Jover Lee <joverlee521@gmail.com>
j23414 added a commit that referenced this pull request Apr 25, 2023
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields
script in the monkeypox repo. The `augur curate normalize` sub-command
has already been added based on the same script (#1039).

Overall this is part of filling in the gaps in the augur curate suite of commands (#860),
specifically addressing issue (#999), and is a follow-up to #1039.

`augur curate titlecase` would transform the values of a given metadata field to titlecase.
This is useful for normalizing the values of a string that may contain inconsistent
capitalization such as "North America" and "north america".

This commit also adds a test for the new sub-command and updates the documentation.
For testing an upper case to lower case circumflex'd o character conversion, had to use
the escaped unicode character

Co-authored-by: Jover Lee <joverlee521@gmail.com>
j23414 added a commit that referenced this pull request Jul 14, 2023
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields
script in the monkeypox repo. The `augur curate normalize` sub-command
has already been added based on the same script (#1039).

Overall this is part of filling in the gaps in the augur curate suite of commands (#860),
specifically addressing issue (#999), and is a follow-up to #1039.

`augur curate titlecase` would transform the values of a given metadata field to titlecase.
This is useful for normalizing the values of a string that may contain inconsistent
capitalization such as "North America" and "north america".

This commit also adds a test for the new sub-command and updates the documentation.
For testing an upper case to lower case circumflex'd o character conversion, had to use
the escaped unicode character

Co-authored-by: Jover Lee <joverlee521@gmail.com>
j23414 added a commit that referenced this pull request Jul 24, 2023
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields
script in the monkeypox repo. The `augur curate normalize` sub-command
has already been added based on the same script (#1039).

Overall this is part of filling in the gaps in the augur curate suite of commands (#860),
specifically addressing issue (#999), and is a follow-up to #1039.

`augur curate titlecase` would transform the values of a given metadata field to titlecase.
This is useful for normalizing the values of a string that may contain inconsistent
capitalization such as "North America" and "north america".

This commit also adds a test for the new sub-command and updates the documentation.
For testing an upper case to lower case circumflex'd o character conversion, had to use
the escaped unicode character

Co-authored-by: Jover Lee <joverlee521@gmail.com>
j23414 added a commit that referenced this pull request Jul 28, 2023
Adds a new sub-command `augur curate titlecase` based on the transform-string-fields
script in the monkeypox repo. The `augur curate normalize` sub-command
has already been added based on the same script (#1039).

Overall this is part of filling in the gaps in the augur curate suite of commands (#860),
specifically addressing issue (#999), and is a follow-up to #1039.

`augur curate titlecase` would transform the values of a given metadata field to titlecase.
This is useful for normalizing the values of a string that may contain inconsistent
capitalization such as "North America" and "north america".

This commit also adds a test for the new sub-command and updates the documentation.
For testing an upper case to lower case circumflex'd o character conversion, had to use
the escaped unicode character

Co-authored-by: Jover Lee <joverlee521@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

augur curate normalize-strings
4 participants