Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide an interface to resolve duplicate strain names in metadata #725

Open
huddlej opened this issue May 20, 2021 · 0 comments
Open

Provide an interface to resolve duplicate strain names in metadata #725

huddlej opened this issue May 20, 2021 · 0 comments
Labels
enhancement New feature or request

Comments

@huddlej
Copy link
Contributor

huddlej commented May 20, 2021

Context
Historically, metadata for Augur inputs were curated to exclude duplicate strains (e.g., by resolving duplicates in fauna during download from the database) and, as such, read_metadata was designed to throw an error when it found duplicate strains. However, GenBank has supported multiple versions of the same strain's sequence for a long time and GISAID recently added similar support. This means a valid metadata download from either of these sources can produce an error when Augur tries to read these data.

Description
Augur should provide an interface to resolve duplicate strain names in the metadata instead of throwing an error. We should retain the option to throw an error on duplicates, but we should also consider making duplicate resolution the default behavior.

Examples

To test the current issue, create some minimal metadata with a duplicated strain:

cut -f 1,5 data/example_metadata.tsv | head -n 4 | sed 's/VIC1008/VIC1000/' > duplicate_metadata.tsv

Then, try to load the data from a Python terminal:

>>> from augur.utils import read_metadata
>>> read_metadata("duplicate_metadata.tsv")
Traceback (most recent call last):
  File "<ipython-input-5-a6d05e39306b>", line 1, in <module>
    read_metadata("duplicate_metadata.tsv")
  File "/Users/jlhudd/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/utils.py", line 74, in read_metadata
    return MetadataFile(fname, query).read()
  File "/Users/jlhudd/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/util_support/metadata_file.py", line 21, in read
    self.check_metadata_duplicates()
  File "/Users/jlhudd/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/util_support/metadata_file.py", line 63, in check_metadata_duplicates
    raise ValueError(
ValueError: Duplicated strain in metadata: Australia/VIC1000/2020

Possible solution

Ideally, the solution to this issue will not require the user to do anything by default; the solution should allow users with duplicate strains to resolve these duplicates automatically.

To resolve duplicate GISAID or GenBank records, we want to prefer the record with the most recent database accession/id. We currently annotate accessions for GISAID and GenBank in our ncov workflow as gisaid_epi_isl and genbank_accession, respectively. One possible solution could then be:

  1. Check for duplicates
  2. If no duplicates, continue.
  3. If duplicates, check for Augur config variable (either in global environment variables or a config file) for whether we should throw an error on duplicates and throw an error as configured.
  4. If duplicates and no error to be thrown, check for one of our predefined accession columns (gisaid_epi_isl and genbank_accession to start).
  5. If an accession column exists, sort records by strain and accession in ascending order and take the last record (or descending/first).
  6. If an accession column does not exist, sort records by strain and take the first record.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant