Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for one or more sequence files as input to subcommands #608

Open
huddlej opened this issue Aug 24, 2020 · 1 comment
Open

Add support for one or more sequence files as input to subcommands #608

huddlej opened this issue Aug 24, 2020 · 1 comment

Comments

@huddlej
Copy link
Contributor

huddlej commented Aug 24, 2020

Proposed feature

Add support for one or more sequence files as input to subcommands. As we plan to support multiple inputs to --metadata arguments in the future, we should support multiple inputs for the --sequences arguments that often accompany the metadata arguments.

Background

Historically, the parse command has been our entry point to Nextstrain builds because we did all the heavy lifting to merge sequences and metadata in our lab’s sequence database. A more common entry point for external users is a curated data set (e.g., metadata and sequences from GISAID) and one or more of their own sets of metadata and sequences.

Most Nextstrain workflows assume that all metadata and sequences have been sufficiently curated prior to starting the workflow that there is only one metadata file and one sequences file.

Internally, if we need to merge two or more FASTA files of sequences, we tend to concatenate these files manually with the UNIX cat command. However, there is a failure mode of cat when files are missing trailing newlines, which some (typically non-Unix) editors and other programs produce, so supporting multiple inputs nicely side-steps this issue for sequences.

Support for multiple sequences already exists in the augur align command, so this functionality has some precedent.

Possible solutions

Internally, we would need to support reading sequences from multiple files into the same standard data structure. We might implement this with a read_sequences function that behaves similarly to the load_alignments function.

To address the external interface on the command line, one solution would be to identify all augur commands that current support unaligned sequences as input with the --sequences argument and add support for multiple arguments to the command line interface.

Another solution would be to encourage users to merge their metadata and sequences as early as possible in their workflow, to avoid multiple sequence inputs and merges downstream. For example, we could add a merge subcommand that knows how to safely merge sequences and metadata into our standard format:

augur merge \
    --sequences gisaid.fasta high_quality_sequences.fasta low_quality_sequences.fasta \
    --metadata gisaid.tsv internal_sequences.csv \
    --output-sequences sequences.fasta \
    --output-metadata metadata.tsv

This command would be the entry point for most external users and produce the same standard outputs we expect from augur parse. If we use this approach, we should focus on a minimal set of functionality to merge data without trying to address all possible data sanitation issues that exist in the world.

Related issues

This issue is related to the issue of supporting multiple metadata inputs through the augur API and, eventually, the command line.

@huddlej
Copy link
Contributor Author

huddlej commented Aug 26, 2020

I forgot that we already implemented a read_sequences function that takes one or more filenames and loads all distinct sequences into a list. Supporting multiple sequence file inputs would then be a matter of calling this function from the augur subcommands where we wish to support multiple inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant