Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing of external gene calls provided by --external-gene-calls #374

Closed
pjeraldo opened this issue Jun 28, 2016 · 16 comments
Closed

Parsing of external gene calls provided by --external-gene-calls #374

pjeraldo opened this issue Jun 28, 2016 · 16 comments
Labels

Comments

@pjeraldo
Copy link

Hello,

This is for version 2.0.0rc3 (83dac84).

When I try to run anvi-gen-contigs-dabatase using a table from an external gene caller, it fails with

Config Error: This sequence does not have proper number of nucleotides to be translated :/

The gene calls seem to be ok, similar to the example given in the help section, with lengths divisible by 3. The relevant code is in dbops.py:

sequence = contig_sequences[contig_name][gene_call['start']:gene_call['stop']]

If sequence is a string, then the index is zero-based, as opposed to the 1-based gene calls. Shifting the gene start position in the gene call file by one (by substraction) seems to work. Shifting the index start to gene_call['start'] - 1 gets rid of that error, but an index out of range error appears coming from line 1259 in dbops.py.

I assume this is a problem since gene coordinates are mostly 1-based.

Can you please take a look?

Thanks,
Patricio

@meren
Copy link
Member

meren commented Jun 28, 2016

Hi Patricio!

Thank you very much for trying out the rc3! I am hoping to release the new stable version very soon, but clearly there are things to address.

The reason you get an error at line 1259 is probably because of the gene_call['start'] - 1. The code should never branch out all the way there.

I will make sure it is clear in the documentation, but anvi'o follows the convention of string indexing that is identical the way one does it in Python or C (so you should change your input file instead of the code to make it work with what you have right now).

I.e., for a gene call that is like this

                 1         2         3
nt pos: 12345678901234567890123456789012
   seq: NNNATGNNNNNNNNNNNNNNNNNTAGAAAAAA
           |______ gene X _______|

The start and stop should be 3 and 26. This is just to have a standard that would make sense

I am not sure whether it is common to start from 1 for all gene callers. If that is the absolute consensus maybe anvi'o can follow that.

I asked @tdelmont many times, but the computer scientist inside him made him insist with the 0-index splicing of strings.

Best,

@meren meren changed the title external gene calls appear to be improperly parsed Parsing of external gene calls provided by --external-gene-calls Jul 1, 2016
@meren meren added the design label Jul 1, 2016
@meren
Copy link
Member

meren commented Jul 1, 2016

The default behavior is now clarified in the tutorial: http://merenlab.org/2016/06/22/anvio-tutorial-v2/#external-gene-calls

@pjeraldo
Copy link
Author

pjeraldo commented Jul 1, 2016

Thank you Meren for the clarification. All is in order.

I also wonder about 1 being the start coordinate for all gene callers. I just think about the target audience being more biology inclined than computer science inclined.

Thanks again.

@meren
Copy link
Member

meren commented Jul 1, 2016

Thanks for letting me know! I am glad it is sorted out.

the target audience being more biology inclined than computer science inclined.

That is a fair point, if I hear one more complaint about this I will change the default behavior and issue a public apology :)

Best,

@pjeraldo
Copy link
Author

pjeraldo commented Jul 1, 2016

No no no no, no need to change it, or apologize for it :) 99.99% of people won't use an external caller. I only did it because I was being impatient and just ran prodigal on a chunked version of my contigs. I'm very happy that you provide a way of adding these gene calls to the contig db.

@meren meren closed this as completed Nov 4, 2016
@seanmcallister
Copy link

I just want to make one comment here! I agree that it is fine to leave it 0-indexed, but I want to ask about the discrepancy in the start and stop positions. Most gene callers would call your example gene from positions 4 - 26. When I was writing a program to convert RAST to an anvio gene table, I just subtracted 1 from both to get a 0-index for both start and stop positions. But this didn't work because in reality, Anvio is looking for a 0-indexed 5' and a 1-indexed 3' (at least from what most gene callers would give you), that is 3-26. Why didn't I say 0-indexed start and 1-indexed stop? Because in the forward direction, the start is 0-indexed and in the reverse direction the stop is 0-indexed. A bit confusing for a lay person, I think. Thoughts?

@ekiefl
Copy link
Contributor

ekiefl commented Oct 2, 2018

This comment is only anecdotal, but when importing external gene calls of collaborators I have had to subtract their start index by 1, but not the stop index. I'm not sure of the gene calling source

@pbravakos
Copy link

I am also facing trouble when i try to import gene calls into anvio. I am not sure what to subtract or add to the start and stop positions in order to replicate the correct gene calls for anvio input.

@AstrobioMike
Copy link
Contributor

@seanmcallister, i've noticed this too. I think it's a convention from python splicing (and more languages I'm sure), where it's "up to but not including" the last index position.
e.g. in python:

variable = "word"
variable[1:3]
would return "or"

@xvazquezc
Copy link
Contributor

The problem is with importing and exporting. The conventions for gene notation indicate that the start position is the 1st nt, and that both start and end positions are included. And always base 1. So, a 9 aa coding peptide would have the coordinates 1-27.

It doesn't really matter how you implement it internally, but when you export it, it will be source of errors. All plain text genome coordinate files are as I described above: gtf/gff, vcf, blast output files, etc

@seanmcallister
Copy link

I agree with @xvazquezc. The problem is that Anvio expects 0-27 for that gene, and not an all around 0-index shift of 0-26, which I think is a bit confusing, as apparently do others. Perhaps this warrants a clearer explanation in the docs? Or...

@AstrobioMike
Copy link
Contributor

It's just a question of the level of convenience or not for the most common user. When i want to cut stuff out of actual sequences, I need to alter the "standard" coordinates to work appropriately with splicing. I happen to think that's annoying. But I'm sure Meren would be happy to change the code-base to do 1-X inclusive

@meren
Copy link
Member

meren commented Oct 17, 2018

But I'm sure Meren would be happy to change the code-base

I have no intention to change it mostly because (1) it is working, (2) it's been working like this for a long time so there is that kind of dependency now compared to a year ago, (3) the way it works is documented clearly in the docs, and (4) there are many other things to do in the codebase before changing working things for convenience. It sounds like it is a general problem under this thread, but I'd like to remind you that we don't hear from others who read the documentation and used it without any issue.

Changing this would require updating the codebase, writing a database migration script, and updating the documentation, and finding anyone who is now using this framework the way it is described in the documentation in their workflows and giving them a heads-up.

Anyone who is importing and exporting gene calls is going to use a program to parse them from one format to another, and it is not a big deal to subtract one from or add one to the start position of every gene call.

@AstrobioMike
Copy link
Contributor

But I'm sure Meren would be happy to change the code-base

i was hoping i was wrong about that 🤣

@pbravakos
Copy link

I agree with @xvazquezc and @seanmcallister and I want to add that when a gene is in reverse order then the start position in the above example (i.e. a 9 aa coding peptide) is 27. So I believe Anvio in that case would expect 1-28 (not 0-27) in order to replicate the correct gene. This significantly complicates things. If I am wrong please correct me.

@meren
Copy link
Member

meren commented Oct 17, 2018

Sorry for the confusion, @pbravakos. But gene start and stop positions do not care about the direction of the gene as they simply address how the gene sequence should be sliced out from a longer sequence. Whether a gene is forward or reverse is defined in the column direction in the external gene calls file. I hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants