Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting multiple gene callers #565

Open
meren opened this Issue Jul 24, 2017 · 6 comments

Comments

Projects
None yet
4 participants
@meren
Copy link
Member

meren commented Jul 24, 2017

Although anvi'o in theory can support multiple gene calls from multiple sources on the same contigs database, in practice there are issues.

This branch is to address those: https://github.com/merenlab/anvio/tree/gene_caller_option

So far it seems it will require a very significant shuffling in the dbops module, but we will see.

We will not address this immediately, but should keep it in mind for a better design in the future.

@AstrobioMike

This comment has been minimized.

Copy link
Contributor

AstrobioMike commented Aug 12, 2017

Hi there,

So when scanning for hmms I did get this fun warning:

93 entries in the sequences table had blank sequences :/ This should never
happen, but it does happen because anvi'o is not as good as it should be. We
opened an issue here: #565, and we are
determined to work on this. If this is like mid-2018 and you run into this error
please find an anvi'o developer and make them feel embarrassed. If it is earlier
than that, please let us know about this, and we will tell you about what you
should be careful about your downstream analyses to be perfect given the best we
know at the time. This is a very minor issue due to on-the-fly addition of
Ribosomal RNA gene calls to the contigs database, and will likely will not
affect anything major. But still. Get in touch with us if you have any
questions.

I am not totally sure what this means? The ribosomal hits wrote out blank sequences for gene-calls maybe?

As you kindly offer in the warning, what should i keep an eye on following this?

Thanks!
-mike

meren added a commit that referenced this issue Aug 12, 2017

@meren

This comment has been minimized.

Copy link
Member Author

meren commented Aug 12, 2017

I updated the warning so it is less concerning. The blank gene sequences come from the gene calls Ribosomal RNAs add to the contigs database. The problem is that anvi'o does not yet know how to handle it throughout the entire codebase if a contigs database has multiple gene callers.

The worst case scenario I can think of is a user ending up with a FASTA file with blank sequences. It should never happen within the normal modes of operation, but since with anvi'o anyone can leave the boundaries of our imagination due to its flexibility, and this can happen.

The warning is a reminder that there is a pebble in our shoe we want to get rid of when we have some time to stop walking. Nothing important, and will not screw up anything.

@AstrobioMike

This comment has been minimized.

Copy link
Contributor

AstrobioMike commented Aug 12, 2017

@jarrodscott

This comment has been minimized.

Copy link

jarrodscott commented May 8, 2018

Howdy all

since this thread is mentioned in an anvio output I am wondering about I will ask my question here :). But happy to move it elsewhere...Maybe there is a very logical explanation for this but just in case I tried to include a lot of details.
See below for operating details.

I am trying to troubleshoot a problem I am having with a downstream analysis of a metagenomic pipeline (specifically here: http://merenlab.org/data/2017_Delmont_et_al_HBDs/#initial-automated-binning-with-concoct). An issue for another time...

I have been looking into my contig.dbs at various step of the pipeline and am curious if I should be worried about this for downstream analysis:

A.initial gene calls using prodigal: 284,562
B. number of genes after HMM: 284,613
C. input for COG analysis (aa.sequences in /tmp/ directory: 284,579
D. number of genes after COG: 284,613

This is the warning I get from anvio when I run anvi-run-ncbi-cogs` command:

WARNING
===============================================
34 entries in the sequences table had blank sequences :/ This is related to the
issue at https://github.com/merenlab/anvio/issues/565. ...

My question is whether this discrepancy could be causing any downstream analysis? What's curious is the difference in gene numbers before and after HMM profiling is 51 yet the gene file after running anvi-run-hmms contains only 34 blank fasta entries. It looks like 51 fasta entries are being added, 17of which are not blank (coincidentally? is 1/2 of 34)

here are the commands:
anvi-gen-contigs-database -f EPM.fa -o EPM-CONTIGS.db
anvi-run-hmms -c EPM-CONTIGS.db -T 20
anvi-run-ncbi-cogs -c EPM-CONTIGS.db -T 20 --cog-data-dir ~/dbs/cog_db/ --temporary-dir-path ~/dbs/cog_db/tmp_sets_EPM/ --search-with diamond --sensitive

To summarize I used anvi-export-table and anvi-get-sequences-for-gene-calls

I tested on two builds and got the same results

This is a master build
Anvi'o version ...............................: 4-master, "rosalind"
Profile DB version ...........................: 26
Contigs DB version ...........................: 12
Pan DB version ...............................: 9
Genome data storage version ..................: 6
Auxiliary data storage version ...............: 2

This is a conda build from 1 week ago.
Anvi'o version ...............................: 4, "rosalind"
Profile DB version ...........................: 26
Contigs DB version ...........................: 12
Pan DB version ...............................: 9
Genome data storage version ..................: 6
Auxiliary data storage version ...............: 2

uname -a
Linux login-30-2.local 2.6.32-696.23.1.el6.centos.plus.x86_64 #1 SMP Wed Mar 14 11:51:06 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

@mytluo

This comment has been minimized.

Copy link

mytluo commented Aug 16, 2018

Hello, I am here to paste this warning I got from running anvi-setup-ncbi-cogs, as it is mid-2018 :)

WARNING

6 entries in the sequences table had blank sequences :/ This is related to the
issue at #565. If this is like mid-2018
and you still get this warning, please find an anvi'o developer and make them
feel embarrassed. If it is earlier than take this as a simple warning that some
gene calls in your downstream analyses may have no sequences, and that's OK.
This is a very minor issue due to on-the-fly addition of Ribosomal RNA gene
calls to the contigs database, and will likely will not affect anything major.
This warning will go away when anvi'o can seamlessly work with multiple gene
callers (which we are looking forward to implement in the future).

@meren

This comment has been minimized.

Copy link
Member Author

meren commented Aug 17, 2018

Thank you very much for your reminder @mytluo :) We are recently reminded of the fact and looking forward to addressing this once and for all :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.