New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing of external gene calls provided by `--external-gene-calls` #374

Closed
pjeraldo opened this Issue Jun 28, 2016 · 8 comments

Comments

Projects
None yet
5 participants
@pjeraldo

pjeraldo commented Jun 28, 2016

Hello,

This is for version 2.0.0rc3 (83dac84).

When I try to run anvi-gen-contigs-dabatase using a table from an external gene caller, it fails with

Config Error: This sequence does not have proper number of nucleotides to be translated :/

The gene calls seem to be ok, similar to the example given in the help section, with lengths divisible by 3. The relevant code is in dbops.py:

sequence = contig_sequences[contig_name][gene_call['start']:gene_call['stop']]

If sequence is a string, then the index is zero-based, as opposed to the 1-based gene calls. Shifting the gene start position in the gene call file by one (by substraction) seems to work. Shifting the index start to gene_call['start'] - 1 gets rid of that error, but an index out of range error appears coming from line 1259 in dbops.py.

I assume this is a problem since gene coordinates are mostly 1-based.

Can you please take a look?

Thanks,
Patricio

@meren

This comment has been minimized.

Show comment
Hide comment
@meren

meren Jun 28, 2016

Member

Hi Patricio!

Thank you very much for trying out the rc3! I am hoping to release the new stable version very soon, but clearly there are things to address.

The reason you get an error at line 1259 is probably because of the gene_call['start'] - 1. The code should never branch out all the way there.

I will make sure it is clear in the documentation, but anvi'o follows the convention of string indexing that is identical the way one does it in Python or C (so you should change your input file instead of the code to make it work with what you have right now).

I.e., for a gene call that is like this

                 1         2         3
nt pos: 12345678901234567890123456789012
   seq: NNNATGNNNNNNNNNNNNNNNNNTAGAAAAAA
           |______ gene X _______|

The start and stop should be 3 and 26. This is just to have a standard that would make sense

I am not sure whether it is common to start from 1 for all gene callers. If that is the absolute consensus maybe anvi'o can follow that.

I asked @tdelmont many times, but the computer scientist inside him made him insist with the 0-index splicing of strings.

Best,

Member

meren commented Jun 28, 2016

Hi Patricio!

Thank you very much for trying out the rc3! I am hoping to release the new stable version very soon, but clearly there are things to address.

The reason you get an error at line 1259 is probably because of the gene_call['start'] - 1. The code should never branch out all the way there.

I will make sure it is clear in the documentation, but anvi'o follows the convention of string indexing that is identical the way one does it in Python or C (so you should change your input file instead of the code to make it work with what you have right now).

I.e., for a gene call that is like this

                 1         2         3
nt pos: 12345678901234567890123456789012
   seq: NNNATGNNNNNNNNNNNNNNNNNTAGAAAAAA
           |______ gene X _______|

The start and stop should be 3 and 26. This is just to have a standard that would make sense

I am not sure whether it is common to start from 1 for all gene callers. If that is the absolute consensus maybe anvi'o can follow that.

I asked @tdelmont many times, but the computer scientist inside him made him insist with the 0-index splicing of strings.

Best,

meren added a commit to merenlab/web that referenced this issue Jul 1, 2016

@meren meren changed the title from external gene calls appear to be improperly parsed to Parsing of external gene calls provided by `--external-gene-calls` Jul 1, 2016

@meren meren added the design label Jul 1, 2016

@meren

This comment has been minimized.

Show comment
Hide comment
@meren

meren Jul 1, 2016

Member

The default behavior is now clarified in the tutorial: http://merenlab.org/2016/06/22/anvio-tutorial-v2/#external-gene-calls

Member

meren commented Jul 1, 2016

The default behavior is now clarified in the tutorial: http://merenlab.org/2016/06/22/anvio-tutorial-v2/#external-gene-calls

@pjeraldo

This comment has been minimized.

Show comment
Hide comment
@pjeraldo

pjeraldo Jul 1, 2016

Thank you Meren for the clarification. All is in order.

I also wonder about 1 being the start coordinate for all gene callers. I just think about the target audience being more biology inclined than computer science inclined.

Thanks again.

pjeraldo commented Jul 1, 2016

Thank you Meren for the clarification. All is in order.

I also wonder about 1 being the start coordinate for all gene callers. I just think about the target audience being more biology inclined than computer science inclined.

Thanks again.

@meren

This comment has been minimized.

Show comment
Hide comment
@meren

meren Jul 1, 2016

Member

Thanks for letting me know! I am glad it is sorted out.

the target audience being more biology inclined than computer science inclined.

That is a fair point, if I hear one more complaint about this I will change the default behavior and issue a public apology :)

Best,

Member

meren commented Jul 1, 2016

Thanks for letting me know! I am glad it is sorted out.

the target audience being more biology inclined than computer science inclined.

That is a fair point, if I hear one more complaint about this I will change the default behavior and issue a public apology :)

Best,

@pjeraldo

This comment has been minimized.

Show comment
Hide comment
@pjeraldo

pjeraldo Jul 1, 2016

No no no no, no need to change it, or apologize for it :) 99.99% of people won't use an external caller. I only did it because I was being impatient and just ran prodigal on a chunked version of my contigs. I'm very happy that you provide a way of adding these gene calls to the contig db.

pjeraldo commented Jul 1, 2016

No no no no, no need to change it, or apologize for it :) 99.99% of people won't use an external caller. I only did it because I was being impatient and just ran prodigal on a chunked version of my contigs. I'm very happy that you provide a way of adding these gene calls to the contig db.

@meren meren closed this Nov 4, 2016

@seanmcallister

This comment has been minimized.

Show comment
Hide comment
@seanmcallister

seanmcallister Oct 2, 2018

I just want to make one comment here! I agree that it is fine to leave it 0-indexed, but I want to ask about the discrepancy in the start and stop positions. Most gene callers would call your example gene from positions 4 - 26. When I was writing a program to convert RAST to an anvio gene table, I just subtracted 1 from both to get a 0-index for both start and stop positions. But this didn't work because in reality, Anvio is looking for a 0-indexed 5' and a 1-indexed 3' (at least from what most gene callers would give you), that is 3-26. Why didn't I say 0-indexed start and 1-indexed stop? Because in the forward direction, the start is 0-indexed and in the reverse direction the stop is 0-indexed. A bit confusing for a lay person, I think. Thoughts?

seanmcallister commented Oct 2, 2018

I just want to make one comment here! I agree that it is fine to leave it 0-indexed, but I want to ask about the discrepancy in the start and stop positions. Most gene callers would call your example gene from positions 4 - 26. When I was writing a program to convert RAST to an anvio gene table, I just subtracted 1 from both to get a 0-index for both start and stop positions. But this didn't work because in reality, Anvio is looking for a 0-indexed 5' and a 1-indexed 3' (at least from what most gene callers would give you), that is 3-26. Why didn't I say 0-indexed start and 1-indexed stop? Because in the forward direction, the start is 0-indexed and in the reverse direction the stop is 0-indexed. A bit confusing for a lay person, I think. Thoughts?

@ekiefl

This comment has been minimized.

Show comment
Hide comment
@ekiefl

ekiefl Oct 2, 2018

Contributor

This comment is only anecdotal, but when importing external gene calls of collaborators I have had to subtract their start index by 1, but not the stop index. I'm not sure of the gene calling source

Contributor

ekiefl commented Oct 2, 2018

This comment is only anecdotal, but when importing external gene calls of collaborators I have had to subtract their start index by 1, but not the stop index. I'm not sure of the gene calling source

@pbravakos

This comment has been minimized.

Show comment
Hide comment
@pbravakos

pbravakos Oct 8, 2018

I am also facing trouble when i try to import gene calls into anvio. I am not sure what to subtract or add to the start and stop positions in order to replicate the correct gene calls for anvio input.

pbravakos commented Oct 8, 2018

I am also facing trouble when i try to import gene calls into anvio. I am not sure what to subtract or add to the start and stop positions in order to replicate the correct gene calls for anvio input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment