Add support for one or more sequence files as input to subcommands #608

huddlej · 2020-08-24T22:58:28Z

Proposed feature

Add support for one or more sequence files as input to subcommands. As we plan to support multiple inputs to --metadata arguments in the future, we should support multiple inputs for the --sequences arguments that often accompany the metadata arguments.

Background

Historically, the parse command has been our entry point to Nextstrain builds because we did all the heavy lifting to merge sequences and metadata in our lab’s sequence database. A more common entry point for external users is a curated data set (e.g., metadata and sequences from GISAID) and one or more of their own sets of metadata and sequences.

Most Nextstrain workflows assume that all metadata and sequences have been sufficiently curated prior to starting the workflow that there is only one metadata file and one sequences file.

Internally, if we need to merge two or more FASTA files of sequences, we tend to concatenate these files manually with the UNIX cat command. However, there is a failure mode of cat when files are missing trailing newlines, which some (typically non-Unix) editors and other programs produce, so supporting multiple inputs nicely side-steps this issue for sequences.

Support for multiple sequences already exists in the augur align command, so this functionality has some precedent.

Possible solutions

Internally, we would need to support reading sequences from multiple files into the same standard data structure. We might implement this with a read_sequences function that behaves similarly to the load_alignments function.

To address the external interface on the command line, one solution would be to identify all augur commands that current support unaligned sequences as input with the --sequences argument and add support for multiple arguments to the command line interface.

Another solution would be to encourage users to merge their metadata and sequences as early as possible in their workflow, to avoid multiple sequence inputs and merges downstream. For example, we could add a merge subcommand that knows how to safely merge sequences and metadata into our standard format:

augur merge \
    --sequences gisaid.fasta high_quality_sequences.fasta low_quality_sequences.fasta \
    --metadata gisaid.tsv internal_sequences.csv \
    --output-sequences sequences.fasta \
    --output-metadata metadata.tsv

This command would be the entry point for most external users and produce the same standard outputs we expect from augur parse. If we use this approach, we should focus on a minimal set of functionality to merge data without trying to address all possible data sanitation issues that exist in the world.

Related issues

This issue is related to the issue of supporting multiple metadata inputs through the augur API and, eventually, the command line.

The text was updated successfully, but these errors were encountered:

huddlej · 2020-08-26T22:07:42Z

I forgot that we already implemented a read_sequences function that takes one or more filenames and loads all distinct sequences into a list. Supporting multiple sequence file inputs would then be a matter of calling this function from the augur subcommands where we wish to support multiple inputs.

huddlej added the enhancement label Aug 24, 2020

victorlin mentioned this issue Aug 30, 2023

WIP: Support multiple inputs during filter #697

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for one or more sequence files as input to subcommands #608

Add support for one or more sequence files as input to subcommands #608

huddlej commented Aug 24, 2020

huddlej commented Aug 26, 2020

Add support for one or more sequence files as input to subcommands #608

Add support for one or more sequence files as input to subcommands #608

Comments

huddlej commented Aug 24, 2020

Proposed feature

Background

Possible solutions

Related issues

huddlej commented Aug 26, 2020