Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pangenomics using transcriptomes as "genomic units" needs a bit of adjustments... #839

Closed
tdelmont opened this issue Jun 2, 2018 · 1 comment

Comments

@tdelmont
Copy link
Contributor

tdelmont commented Jun 2, 2018

Hi there,
this is a general comment on potential future developments not related to a specific anvi'o version, or bug.

I recently realized that the pangenomic workflow of anvi'o could be improved at 3 levels in order to include (meta)transcriptomes as "genomic units". This is particularly useful in cases where a reference genome is not available, but transcriptomic data is.

Here are the 3 bottlenecks I identified so far:

(1) each contig is in theory a "gene fragment" for transcriptomic data, so it would be very useful to offer a special flag during anvi-gen-contigs-database so that anvi'o knows each contig should be identified as a single gene covering the entire sequence.

(2) this leads to another problem: we do not know the direction of the gene... Would it be possible to use a letter "x" (or else) instead of "r" or "f", so that anvi'o knows we do not know the direction or even the frame of the gene? I realize this might add a lot of complications, but it is key for the third bottleneck.

(3) could it be possible to allow a flag for a blastx (instead of blastp) when computing the gene clusters. This way, all frames of each gene would be computed to find best matches. I might have missed something obvious, but as far as I can see, this could allow the making of relevant gene clusters compatible with both genomic and transcriptomic data...

I cannot share too much details, but I have interesting research avenues that could be explored contingent upon few improvements to increase flexibility of the anvi'o pangenomic workflow.

Does that make sense, and is this of interest to some of the anvi'o developers?

Thanks for reflecting on this request,

Tom

@meren
Copy link
Member

meren commented Jun 2, 2018

Hey Tom,

(1) each contig is in theory a "gene fragment" for transcriptomic data, so it would be very useful to offer a special flag during anvi-gen-contigs-database so that anvi'o knows each contig should be identified as a single gene covering the entire sequence.

This is possible, and I can see how it could be useful. I did run into similar situations and ended up generating external gene calls files with partial gene calls that covered the entire contig. So it doesn't need a change in anvi'o in theory, but I agree that it would make things much easier.

(2) Would it be possible to use a letter "x" (or else) instead of "r" or "f", so that anvi'o knows we do not know the direction or even the frame of the gene? I realize this might add a lot of complications, but it is key for the third bottleneck.

This is doable, but it would take a lot of time and energy we currently can't afford :(

(3) could it be possible to allow a flag for a blastx (instead of blastp) when computing the gene clusters. This way, all frames of each gene would be computed to find best matches.

This is also doable, and in fact is not dependent upon the first one. Although this would be so computationally demanding. We can't blastx only some of the data. The current strategy would only allow us to do reciprocal blastp or reciprocal blastx over the entire genomes storage. If you have contigs databases that are filled with partial gene calls with unknown directions that can't be turned into amino acid sequences reliably, you could elect to use blastx, and wait four years, but I don't see why things wouldn't run smoothly :)

We should think about that.

In an ideal world anvi'o would have a comparative genomics person dedicated to improve these aspects of the platform. We'll see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants