-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Omit duplicate sequences in filter output #898
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #898 +/- ##
==========================================
+ Coverage 34.62% 34.65% +0.02%
==========================================
Files 42 42
Lines 6007 6014 +7
Branches 1538 1540 +2
==========================================
+ Hits 2080 2084 +4
- Misses 3854 3855 +1
- Partials 73 75 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad to see this change was so straightforward!
Playing devil's advocate: is there any reason a user may want to include duplicates? Should there be a --include-duplicate-sequences
option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on @joverlee521's comments otherwise this looks good!
Instead of dutifully writing all sequences for strains that pass given filters, only write out the first sequence for each strain from the given input sequences and print a warning to stderr when duplicates are detected. This change allows Augur and the workflows that depend on it to be more robust to upstream data quality issues.
82efc02
to
7554c9d
Compare
Thanks, @joverlee521 and @victorlin! I actually prefer to default to throwing an error when we detect duplicates, since duplicates for a single strain name are almost always going to reflect bugs in the upstream workflow. The same would be true with multiple records for the same strain in metadata. When I tested the ncov workflow with duplicate sequences and this Augur PR, I still ran into an error caused by augur index outputting multiple rows for the same strain when duplicate sequences exist. Since augur index runs prior to augur filter, we also need to omit duplicates in the augur index output, to fix the original issue in the ncov workflow. It's not really augur filter's job (or augur index's) to resolve duplicates (unless you think of resolving duplicates as a kind of filter but the same logic doesn't apply to augur index). Now that I've tried both the workflow- and augur-based approaches to handling these duplicates, I wish we had an |
As with augur filter, skip duplicate copies of strains that have already been indexed and warn the user.
7554c9d
to
fdac030
Compare
I'm guessing there are many things that can fall under the duty of Could we start with the basics (just de-duplication) and add features over time? Creating a new command |
We can definitely build out a new subcommand gradually over time and start with deduplication. This issue is the best place to start, for reference. We got stuck on the name and scope of the proposed new subcommand, but one way we've handled this in the past is to mock up a bunch of example commands of how we'd like a subcommand to work and then write the code to that (TDD with functional tests, sort of). Do you want to try this mockup for deduplication of sequences (keeping in mind that we also want to support deduplication of metadata as a separate command)? |
Oh, I didn't realize naming was the blocker there. Yeah happy to help 😃 |
Description of proposed changes
Instead of dutifully writing all sequences for strains that pass given filters, only write out the first sequence for each strain from the given input sequences and print a warning to stderr when duplicates are detected. This change allows Augur and the workflows that depend on it to be more robust to upstream data quality issues.
Related issue(s)
Related to nextstrain/ncov#926 and nextstrain/ncov-ingest#281
Testing