Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Omit duplicate sequences in filter output #898

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Apr 21, 2022

Description of proposed changes

Instead of dutifully writing all sequences for strains that pass given filters, only write out the first sequence for each strain from the given input sequences and print a warning to stderr when duplicates are detected. This change allows Augur and the workflows that depend on it to be more robust to upstream data quality issues.

Related issue(s)

Related to nextstrain/ncov#926 and nextstrain/ncov-ingest#281

Testing

  • Adds a functional test to confirm that duplicate sequences are not written
  • Tested by CI

@codecov
Copy link

codecov bot commented Apr 21, 2022

Codecov Report

Merging #898 (fdac030) into master (002c44f) will increase coverage by 0.02%.
The diff coverage is 44.44%.

@@            Coverage Diff             @@
##           master     #898      +/-   ##
==========================================
+ Coverage   34.62%   34.65%   +0.02%     
==========================================
  Files          42       42              
  Lines        6007     6014       +7     
  Branches     1538     1540       +2     
==========================================
+ Hits         2080     2084       +4     
- Misses       3854     3855       +1     
- Partials       73       75       +2     
Impacted Files Coverage Δ
augur/index.py 63.29% <40.00%> (-1.58%) ⬇️
augur/filter.py 70.48% <50.00%> (+0.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 002c44f...fdac030. Read the comment docs.

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see this change was so straightforward!

Playing devil's advocate: is there any reason a user may want to include duplicates? Should there be a --include-duplicate-sequences option?

augur/filter.py Outdated Show resolved Hide resolved
Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on @joverlee521's comments otherwise this looks good!

Instead of dutifully writing all sequences for strains that pass given
filters, only write out the first sequence for each strain from the
given input sequences and print a warning to stderr when duplicates are
detected. This change allows Augur and the workflows that depend on it
to be more robust to upstream data quality issues.
@huddlej
Copy link
Contributor Author

huddlej commented Apr 22, 2022

Thanks, @joverlee521 and @victorlin! I actually prefer to default to throwing an error when we detect duplicates, since duplicates for a single strain name are almost always going to reflect bugs in the upstream workflow. The same would be true with multiple records for the same strain in metadata.

When I tested the ncov workflow with duplicate sequences and this Augur PR, I still ran into an error caused by augur index outputting multiple rows for the same strain when duplicate sequences exist. Since augur index runs prior to augur filter, we also need to omit duplicates in the augur index output, to fix the original issue in the ncov workflow.

It's not really augur filter's job (or augur index's) to resolve duplicates (unless you think of resolving duplicates as a kind of filter but the same logic doesn't apply to augur index). Now that I've tried both the workflow- and augur-based approaches to handling these duplicates, I wish we had an augur sanitize command that we could run in the workflow and whose output could be streamed into augur index and filter. I'm leaning back toward the workflow-based solution, for now.

As with augur filter, skip duplicate copies of strains that have already
been indexed and warn the user.
@victorlin
Copy link
Member

victorlin commented Apr 22, 2022

@huddlej

I'm guessing there are many things that can fall under the duty of augur sanitize, some which might not be straightforward to define/implement.

Could we start with the basics (just de-duplication) and add features over time? Creating a new command augur sanitize for de-duplication seems like minimal work which I can help implement.

@huddlej
Copy link
Contributor Author

huddlej commented Apr 22, 2022

We can definitely build out a new subcommand gradually over time and start with deduplication. This issue is the best place to start, for reference. We got stuck on the name and scope of the proposed new subcommand, but one way we've handled this in the past is to mock up a bunch of example commands of how we'd like a subcommand to work and then write the code to that (TDD with functional tests, sort of).

Do you want to try this mockup for deduplication of sequences (keeping in mind that we also want to support deduplication of metadata as a separate command)?

@victorlin
Copy link
Member

Oh, I didn't realize naming was the blocker there. Yeah happy to help 😃

@huddlej huddlej added the needs triage Needs triage by a Nextstrain team member label Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Needs triage by a Nextstrain team member
Projects
No open projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

None yet

3 participants