Omit duplicate sequences in filter output #898

huddlej · 2022-04-21T19:10:29Z

Description of proposed changes

Instead of dutifully writing all sequences for strains that pass given filters, only write out the first sequence for each strain from the given input sequences and print a warning to stderr when duplicates are detected. This change allows Augur and the workflows that depend on it to be more robust to upstream data quality issues.

Related issue(s)

Related to nextstrain/ncov#926 and nextstrain/ncov-ingest#281

Testing

Adds a functional test to confirm that duplicate sequences are not written
Tested by CI

codecov · 2022-04-21T19:16:21Z

Codecov Report

Merging #898 (fdac030) into master (002c44f) will increase coverage by 0.02%.
The diff coverage is 44.44%.

@@            Coverage Diff             @@
##           master     #898      +/-   ##
==========================================
+ Coverage   34.62%   34.65%   +0.02%     
==========================================
  Files          42       42              
  Lines        6007     6014       +7     
  Branches     1538     1540       +2     
==========================================
+ Hits         2080     2084       +4     
- Misses       3854     3855       +1     
- Partials       73       75       +2

Impacted Files	Coverage Δ
augur/index.py	`63.29% <40.00%> (-1.58%)`	⬇️
augur/filter.py	`70.48% <50.00%> (+0.11%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 002c44f...fdac030. Read the comment docs.

joverlee521

Glad to see this change was so straightforward!

Playing devil's advocate: is there any reason a user may want to include duplicates? Should there be a --include-duplicate-sequences option?

augur/filter.py

victorlin

+1 on @joverlee521's comments otherwise this looks good!

Instead of dutifully writing all sequences for strains that pass given filters, only write out the first sequence for each strain from the given input sequences and print a warning to stderr when duplicates are detected. This change allows Augur and the workflows that depend on it to be more robust to upstream data quality issues.

huddlej · 2022-04-22T20:38:20Z

Thanks, @joverlee521 and @victorlin! I actually prefer to default to throwing an error when we detect duplicates, since duplicates for a single strain name are almost always going to reflect bugs in the upstream workflow. The same would be true with multiple records for the same strain in metadata.

When I tested the ncov workflow with duplicate sequences and this Augur PR, I still ran into an error caused by augur index outputting multiple rows for the same strain when duplicate sequences exist. Since augur index runs prior to augur filter, we also need to omit duplicates in the augur index output, to fix the original issue in the ncov workflow.

It's not really augur filter's job (or augur index's) to resolve duplicates (unless you think of resolving duplicates as a kind of filter but the same logic doesn't apply to augur index). Now that I've tried both the workflow- and augur-based approaches to handling these duplicates, I wish we had an augur sanitize command that we could run in the workflow and whose output could be streamed into augur index and filter. I'm leaning back toward the workflow-based solution, for now.

As with augur filter, skip duplicate copies of strains that have already been indexed and warn the user.

victorlin · 2022-04-22T22:40:22Z

@huddlej

I'm guessing there are many things that can fall under the duty of augur sanitize, some which might not be straightforward to define/implement.

Could we start with the basics (just de-duplication) and add features over time? Creating a new command augur sanitize for de-duplication seems like minimal work which I can help implement.

huddlej · 2022-04-22T23:16:26Z

We can definitely build out a new subcommand gradually over time and start with deduplication. This issue is the best place to start, for reference. We got stuck on the name and scope of the proposed new subcommand, but one way we've handled this in the past is to mock up a bunch of example commands of how we'd like a subcommand to work and then write the code to that (TDD with functional tests, sort of).

Do you want to try this mockup for deduplication of sequences (keeping in mind that we also want to support deduplication of metadata as a separate command)?

victorlin · 2022-04-22T23:35:05Z

Oh, I didn't realize naming was the blocker there. Yeah happy to help 😃

joverlee521 requested changes Apr 21, 2022

View reviewed changes

augur/filter.py Outdated Show resolved Hide resolved

victorlin approved these changes Apr 21, 2022

View reviewed changes

huddlej force-pushed the omit-duplicate-sequences branch from 82efc02 to 7554c9d Compare April 22, 2022 19:11

Omit duplicate sequences from sequence index

fdac030

As with augur filter, skip duplicate copies of strains that have already been indexed and warn the user.

huddlej force-pushed the omit-duplicate-sequences branch from 7554c9d to fdac030 Compare April 22, 2022 21:18

huddlej added the needs triage Needs triage by a Nextstrain team member label Apr 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Omit duplicate sequences in filter output #898

Omit duplicate sequences in filter output #898

huddlej commented Apr 21, 2022 •

edited

codecov bot commented Apr 21, 2022 •

edited

joverlee521 left a comment

victorlin left a comment

huddlej commented Apr 22, 2022 •

edited

victorlin commented Apr 22, 2022 •

edited

huddlej commented Apr 22, 2022

victorlin commented Apr 22, 2022

Omit duplicate sequences in filter output #898

Are you sure you want to change the base?

Omit duplicate sequences in filter output #898

Conversation

huddlej commented Apr 21, 2022 • edited

Description of proposed changes

Related issue(s)

Testing

codecov bot commented Apr 21, 2022 • edited

Codecov Report

joverlee521 left a comment

Choose a reason for hiding this comment

victorlin left a comment

Choose a reason for hiding this comment

huddlej commented Apr 22, 2022 • edited

victorlin commented Apr 22, 2022 • edited

huddlej commented Apr 22, 2022

victorlin commented Apr 22, 2022

huddlej commented Apr 21, 2022 •

edited

codecov bot commented Apr 21, 2022 •

edited

huddlej commented Apr 22, 2022 •

edited

victorlin commented Apr 22, 2022 •

edited