-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing of external gene calls provided by --external-gene-calls
#374
Comments
Hi Patricio! Thank you very much for trying out the rc3! I am hoping to release the new stable version very soon, but clearly there are things to address. The reason you get an error at line 1259 is probably because of the I will make sure it is clear in the documentation, but anvi'o follows the convention of string indexing that is identical the way one does it in Python or C (so you should change your input file instead of the code to make it work with what you have right now). I.e., for a gene call that is like this 1 2 3
nt pos: 12345678901234567890123456789012
seq: NNNATGNNNNNNNNNNNNNNNNNTAGAAAAAA
|______ gene X _______| The I am not sure whether it is common to start from I asked @tdelmont many times, but the computer scientist inside him made him insist with the 0-index splicing of strings. Best, |
--external-gene-calls
The default behavior is now clarified in the tutorial: http://merenlab.org/2016/06/22/anvio-tutorial-v2/#external-gene-calls |
Thank you Meren for the clarification. All is in order. I also wonder about 1 being the start coordinate for all gene callers. I just think about the target audience being more biology inclined than computer science inclined. Thanks again. |
Thanks for letting me know! I am glad it is sorted out.
That is a fair point, if I hear one more complaint about this I will change the default behavior and issue a public apology :) Best, |
No no no no, no need to change it, or apologize for it :) 99.99% of people won't use an external caller. I only did it because I was being impatient and just ran prodigal on a chunked version of my contigs. I'm very happy that you provide a way of adding these gene calls to the contig db. |
I just want to make one comment here! I agree that it is fine to leave it 0-indexed, but I want to ask about the discrepancy in the start and stop positions. Most gene callers would call your example gene from positions 4 - 26. When I was writing a program to convert RAST to an anvio gene table, I just subtracted 1 from both to get a 0-index for both start and stop positions. But this didn't work because in reality, Anvio is looking for a 0-indexed 5' and a 1-indexed 3' (at least from what most gene callers would give you), that is 3-26. Why didn't I say 0-indexed start and 1-indexed stop? Because in the forward direction, the start is 0-indexed and in the reverse direction the stop is 0-indexed. A bit confusing for a lay person, I think. Thoughts? |
This comment is only anecdotal, but when importing external gene calls of collaborators I have had to subtract their start index by 1, but not the stop index. I'm not sure of the gene calling source |
I am also facing trouble when i try to import gene calls into anvio. I am not sure what to subtract or add to the start and stop positions in order to replicate the correct gene calls for anvio input. |
@seanmcallister, i've noticed this too. I think it's a convention from python splicing (and more languages I'm sure), where it's "up to but not including" the last index position. variable = "word" |
The problem is with importing and exporting. The conventions for gene notation indicate that the start position is the 1st nt, and that both start and end positions are included. And always base 1. So, a 9 aa coding peptide would have the coordinates 1-27. It doesn't really matter how you implement it internally, but when you export it, it will be source of errors. All plain text genome coordinate files are as I described above: gtf/gff, vcf, blast output files, etc |
I agree with @xvazquezc. The problem is that Anvio expects 0-27 for that gene, and not an all around 0-index shift of 0-26, which I think is a bit confusing, as apparently do others. Perhaps this warrants a clearer explanation in the docs? Or... |
It's just a question of the level of convenience or not for the most common user. When i want to cut stuff out of actual sequences, I need to alter the "standard" coordinates to work appropriately with splicing. I happen to think that's annoying. But I'm sure Meren would be happy to change the code-base to do 1-X inclusive |
I have no intention to change it mostly because (1) it is working, (2) it's been working like this for a long time so there is that kind of dependency now compared to a year ago, (3) the way it works is documented clearly in the docs, and (4) there are many other things to do in the codebase before changing working things for convenience. It sounds like it is a general problem under this thread, but I'd like to remind you that we don't hear from others who read the documentation and used it without any issue. Changing this would require updating the codebase, writing a database migration script, and updating the documentation, and finding anyone who is now using this framework the way it is described in the documentation in their workflows and giving them a heads-up. Anyone who is importing and exporting gene calls is going to use a program to parse them from one format to another, and it is not a big deal to subtract one from or add one to the start position of every gene call. |
i was hoping i was wrong about that 🤣 |
I agree with @xvazquezc and @seanmcallister and I want to add that when a gene is in reverse order then the start position in the above example (i.e. a 9 aa coding peptide) is 27. So I believe Anvio in that case would expect 1-28 (not 0-27) in order to replicate the correct gene. This significantly complicates things. If I am wrong please correct me. |
Sorry for the confusion, @pbravakos. But gene |
Hello,
This is for version 2.0.0rc3 (83dac84).
When I try to run
anvi-gen-contigs-dabatase
using a table from an external gene caller, it fails withThe gene calls seem to be ok, similar to the example given in the help section, with lengths divisible by 3. The relevant code is in dbops.py:
If
sequence
is a string, then the index is zero-based, as opposed to the 1-based gene calls. Shifting the gene start position in the gene call file by one (by substraction) seems to work. Shifting the index start togene_call['start'] - 1
gets rid of that error, but an index out of range error appears coming from line 1259 in dbops.py.I assume this is a problem since gene coordinates are mostly 1-based.
Can you please take a look?
Thanks,
Patricio
The text was updated successfully, but these errors were encountered: