Provide an interface to resolve duplicate strain names in metadata #725

huddlej · 2021-05-20T16:03:55Z

Context
Historically, metadata for Augur inputs were curated to exclude duplicate strains (e.g., by resolving duplicates in fauna during download from the database) and, as such, read_metadata was designed to throw an error when it found duplicate strains. However, GenBank has supported multiple versions of the same strain's sequence for a long time and GISAID recently added similar support. This means a valid metadata download from either of these sources can produce an error when Augur tries to read these data.

Description
Augur should provide an interface to resolve duplicate strain names in the metadata instead of throwing an error. We should retain the option to throw an error on duplicates, but we should also consider making duplicate resolution the default behavior.

Examples

To test the current issue, create some minimal metadata with a duplicated strain:

cut -f 1,5 data/example_metadata.tsv | head -n 4 | sed 's/VIC1008/VIC1000/' > duplicate_metadata.tsv

Then, try to load the data from a Python terminal:

>>> from augur.utils import read_metadata
>>> read_metadata("duplicate_metadata.tsv")
Traceback (most recent call last):
  File "<ipython-input-5-a6d05e39306b>", line 1, in <module>
    read_metadata("duplicate_metadata.tsv")
  File "/Users/jlhudd/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/utils.py", line 74, in read_metadata
    return MetadataFile(fname, query).read()
  File "/Users/jlhudd/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/util_support/metadata_file.py", line 21, in read
    self.check_metadata_duplicates()
  File "/Users/jlhudd/miniconda3/envs/nextstrain/lib/python3.8/site-packages/augur/util_support/metadata_file.py", line 63, in check_metadata_duplicates
    raise ValueError(
ValueError: Duplicated strain in metadata: Australia/VIC1000/2020

Possible solution

Ideally, the solution to this issue will not require the user to do anything by default; the solution should allow users with duplicate strains to resolve these duplicates automatically.

To resolve duplicate GISAID or GenBank records, we want to prefer the record with the most recent database accession/id. We currently annotate accessions for GISAID and GenBank in our ncov workflow as gisaid_epi_isl and genbank_accession, respectively. One possible solution could then be:

Check for duplicates
If no duplicates, continue.
If duplicates, check for Augur config variable (either in global environment variables or a config file) for whether we should throw an error on duplicates and throw an error as configured.
If duplicates and no error to be thrown, check for one of our predefined accession columns (gisaid_epi_isl and genbank_accession to start).
If an accession column exists, sort records by strain and accession in ascending order and take the last record (or descending/first).
If an accession column does not exist, sort records by strain and take the first record.

The text was updated successfully, but these errors were encountered:

huddlej added the enhancement New feature or request label May 20, 2021

huddlej mentioned this issue May 20, 2021

Support default GISAID metadata and sequences nextstrain/ncov#640

Merged

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide an interface to resolve duplicate strain names in metadata #725

Provide an interface to resolve duplicate strain names in metadata #725

huddlej commented May 20, 2021

Provide an interface to resolve duplicate strain names in metadata #725

Provide an interface to resolve duplicate strain names in metadata #725

Comments

huddlej commented May 20, 2021