Download subset of genome sequences for selected tree nodes #1110

biocyberman · 2020-05-06T21:25:28Z

Context
Similar to download a subset of metadata, we want to use this to extract a subset of genome sequences for further analysis. This might not be helpful or allowed in the global nextstrain/auspice instance, but for our local one, it is legal and useful feature to have.

Description
This feature should work almost exactly like extract subset of metadata.

Possible solution

augur export need to export and include genome sequences upon user choice (i.e. an --export-genomes flag)
genomes are saved in the same directory as auspice's datasetDir.
if auspice can find files endswith -sequences.json it present the download subset of sequences button in the Download Data popup window.

An observation: auspice removed handling of sequences.JSON at version 1.8.0. Probably this feature is related to the code of handling sequences.

The text was updated successfully, but these errors were encountered:

jameshadfield · 2020-05-06T23:28:58Z

This feature could be made part of auspice and made an "opt-out" extension (or opt-in) so that different implementations can choose whether or not to expose it. I know others have asked for it, so I think it would get used. Happy for someone to implement it, but it's not something we (nextstrain.org) can pursue currently.

Just thinking about it briefly, it would involve a new API call to fetch the sequences, subset them, and download them. Or you could post the subsetted strain list and ask for a matching sequences file from the server. There would be memory/speed considerations here as sequence data can be very large, comes in different formats (VCF, fasta) etcetera. I don't think making a new JSON sequence format would be recommended.

Currently one can download a metadata TSV subsetted appropriately, which you could then use to get the sequences you want via a script (or a different web API etc). I appreciate that it may be nicer to do it all within auspice, but there may be easier short-term solutions.

auspice removed handling of sequences.JSON at version 1.8.0

We used to rely on this to extract mutations to display genotypes as I remember (it was >2 years ago). It's on the horizon for us to implement fetching of one (ancestral) sequence which we need to colour the tree by a position which has no observed mutations. It will probably be in fasta format, but the details haven't been worked out. But this is separate from what's being asked in this issue.

biocyberman · 2020-05-07T07:13:18Z

Thanks for the comments.

I know others have asked for it, so I think it would get used. Happy for someone to implement it, but it's not something we (nextstrain.org) can pursue currently.

Sounds like something worth pursuing. I can probably arrange some time to do this, depending on task priority in the COVID19 project I am working with.

Currently one can download a metadata TSV subsetted appropriately, which you could then use to get the sequences you want via a script (or a different web API etc). I appreciate that it may be nicer to do it all within auspice, but there may be easier short-term solutions.

I already wrote a bash script to do the extraction. It's actually wrote a short script to do that, but I agree, auspice interface is more interactive and less intimidating.

biocyberman · 2020-05-09T10:18:44Z

@jameshadfield I made some progress hacking the feature. Need you comments and guidance:

Since nt_muts.json from a augur already contain sequences. I am thinking to copy it into datasetDir. In strainGenome function, I want to open a stream and filter the json file by strain names and return the FASTA file. The problem is I don't know how and whether it is a good idea to exposure datasetDir there. If not, where would be the better idea to parse nt_muts.json at run time? To improve efficiency, maybe a lightweight database like tingodb would be better? For what is worth nt_muts.json can also be simplified to contain only names and sequences.
Alternatively, with a list of strain names, the function can spawn a shell process pipe to run seqtk command similar to what I did in the shell script and return the file to client. A similar question with accessing the genomes.fasta file.

biocyberman added the enhancement New feature or request label May 6, 2020

biocyberman linked a pull request May 30, 2020 that will close this issue

Add opt-in genome download feature #1149

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download subset of genome sequences for selected tree nodes #1110

Download subset of genome sequences for selected tree nodes #1110

biocyberman commented May 6, 2020

jameshadfield commented May 6, 2020 •

edited

biocyberman commented May 7, 2020

biocyberman commented May 9, 2020 •

edited

Download subset of genome sequences for selected tree nodes #1110

Download subset of genome sequences for selected tree nodes #1110

Comments

biocyberman commented May 6, 2020

jameshadfield commented May 6, 2020 • edited

biocyberman commented May 7, 2020

biocyberman commented May 9, 2020 • edited

jameshadfield commented May 6, 2020 •

edited

biocyberman commented May 9, 2020 •

edited