Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download subset of genome sequences for selected tree nodes #1110

Open
biocyberman opened this issue May 6, 2020 · 3 comments · May be fixed by #1149
Open

Download subset of genome sequences for selected tree nodes #1110

biocyberman opened this issue May 6, 2020 · 3 comments · May be fixed by #1149
Labels
enhancement New feature or request

Comments

@biocyberman
Copy link

Context
Similar to download a subset of metadata, we want to use this to extract a subset of genome sequences for further analysis. This might not be helpful or allowed in the global nextstrain/auspice instance, but for our local one, it is legal and useful feature to have.

Description
This feature should work almost exactly like extract subset of metadata.

Possible solution

  • augur export need to export and include genome sequences upon user choice (i.e. an --export-genomes flag)
  • genomes are saved in the same directory as auspice's datasetDir.
  • if auspice can find files endswith -sequences.json it present the download subset of sequences button in the Download Data popup window.

An observation: auspice removed handling of sequences.JSON at version 1.8.0. Probably this feature is related to the code of handling sequences.

@biocyberman biocyberman added the enhancement New feature or request label May 6, 2020
@jameshadfield
Copy link
Member

jameshadfield commented May 6, 2020

This feature could be made part of auspice and made an "opt-out" extension (or opt-in) so that different implementations can choose whether or not to expose it. I know others have asked for it, so I think it would get used. Happy for someone to implement it, but it's not something we (nextstrain.org) can pursue currently.

Just thinking about it briefly, it would involve a new API call to fetch the sequences, subset them, and download them. Or you could post the subsetted strain list and ask for a matching sequences file from the server. There would be memory/speed considerations here as sequence data can be very large, comes in different formats (VCF, fasta) etcetera. I don't think making a new JSON sequence format would be recommended.

Currently one can download a metadata TSV subsetted appropriately, which you could then use to get the sequences you want via a script (or a different web API etc). I appreciate that it may be nicer to do it all within auspice, but there may be easier short-term solutions.

auspice removed handling of sequences.JSON at version 1.8.0

We used to rely on this to extract mutations to display genotypes as I remember (it was >2 years ago). It's on the horizon for us to implement fetching of one (ancestral) sequence which we need to colour the tree by a position which has no observed mutations. It will probably be in fasta format, but the details haven't been worked out. But this is separate from what's being asked in this issue.

@biocyberman
Copy link
Author

Thanks for the comments.

I know others have asked for it, so I think it would get used. Happy for someone to implement it, but it's not something we (nextstrain.org) can pursue currently.

Sounds like something worth pursuing. I can probably arrange some time to do this, depending on task priority in the COVID19 project I am working with.

Currently one can download a metadata TSV subsetted appropriately, which you could then use to get the sequences you want via a script (or a different web API etc). I appreciate that it may be nicer to do it all within auspice, but there may be easier short-term solutions.

I already wrote a bash script to do the extraction. It's actually wrote a short script to do that, but I agree, auspice interface is more interactive and less intimidating.

@biocyberman
Copy link
Author

biocyberman commented May 9, 2020

@jameshadfield I made some progress hacking the feature. Need you comments and guidance:

  1. Since nt_muts.json from a augur already contain sequences. I am thinking to copy it into datasetDir. In strainGenome function, I want to open a stream and filter the json file by strain names and return the FASTA file. The problem is I don't know how and whether it is a good idea to exposure datasetDir there. If not, where would be the better idea to parse nt_muts.json at run time? To improve efficiency, maybe a lightweight database like tingodb would be better? For what is worth nt_muts.json can also be simplified to contain only names and sequences.

  2. Alternatively, with a list of strain names, the function can spawn a shell process pipe to run seqtk command similar to what I did in the shell script and return the file to client. A similar question with accessing the genomes.fasta file.

@biocyberman biocyberman linked a pull request May 30, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants