New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Funannotate exits in predict: bedtools intersect error #522
Comments
Hi @felipe797. Thanks for reporting. So you are saying that if we just add If you wanted to skip that step, you could pass |
No, if you use
I sorted the files manually using unix |
Right. So let me look at it a little closer, but one issue might be the GFF3 parser after this -- there isn't a way to to maintain the GFF3 structure (gene --> mRNA --> exon --> CDS) with using either |
That is true. |
Yeah it might be slower with interlap. I just made it generated sorted input for the repeat filtering step as I'm not using those outputs directly, testing locally and then I can push those changes. I can also write a sorting "properly" with python and keep the desired GFF3 structure. So do you know if then it also dies on the tRNA step? Or likely you haven't gotten there yet I presume because it died on the repeat filtering. |
Yeah, I didn't get there. |
Okay, here is a quick python function I'll use to sort properly, I don't think this will break bedtools sorting as it is just using as a tiebreaker the features and sorting those in the order I want them:
|
This looks great. |
Not sure if gffutils has sort feature like this too? |
@hyphaltip I'm still testing here, it didn't like my first attempt. But if works I think I'll use this sorting on all incoming GFFs as well as sometimes that can be one of the issues. I'll tag this issue when I push the working code. |
We were going to try a larger memory allocation for the job but it seems like if we can solve this with sorted it would be more efficient? |
Okay, this passed the test data. Added |
I updated on the system @felipe797 so you can re-try and see if it behaves better? |
@hyphaltip still giving the same error for the two files. I did a little investigation on the
while
Which is essentially the same as the file I was having trouble with before. |
Hmm, I used python We can maybe be more explicit with the |
In this case I think what's causing trouble is the start position of the contigs. It's probably expecting upstream contigs to come before in the file (ie scaffold 99's CDS starts on 115720 and scaffold 558's on 6650, so the latter should come first). |
I don't think so, its |
@hyphaltip try that latest, tests pass locally. Now using |
@nextgenusfs I gave it a another try with the updated version today and it worked perfectly. |
Hi Jon!
First of all, thanks for the pipeline, it's been really useful to me!
So, I ran into this problem in predict, where it's quitting after a bedtools intersect error. From logfile:
[09:39 AM]: Generating protein fasta files from 22,573 EVM models\n [09:39 AM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc).\n [09:39 AM]: CMD ERROR: bedtools intersect -f 0.9 -a\n /bigdata/stajichlab/fmoreira/Metarhizium_acridum/annotate/New_polish/polished/predict_misc/evm.round1.gff3 -b /bigdata/stajichlab/fmoreira/Metarhizium_acridum/annotate/New_polish/polished/predict_misc/repeatmasker.bed [09:39 AM]: (None, b'ERROR: Received illegal bin number -1 from getBin call.\nMaximum values is: 2396745\nThis typically means that your coordinates are\nnegative or too large to represent in the data\nstructure bedtools uses to find intersections.ERROR: Unable to add record to tree.\n')
I'm running the new versions of both funannotate (v1.8.2) and bedtools (v2.29.2).
After some troubleshooting, I found out bedtools intersect has problems with large scaffolds and the recommendation is, according to their readthedocs:
I proceeded to sort both my
repeatmasker.bed
andevm.round1.gff3
and run bedtools intersect manually on the sorted new files, using the same command funannotate predict did and same error popped up :But it ran smoothly in the same command but with the flag
-sorted
with the sorted files:bedtools intersect -f 0.9 -a evm.round1.gff3 -b repeatmasker.sorted.bed -sorted
Not sure but maybe a suggestion would be conditional statement to check if infiles are sorted by chromosome and start position and if so run the command with -sorted?
in
funannotate/library.py
line 5656 :def validate_tRNA(input, genes, gaps, output): cmd = ['bedtools', 'intersect', '-v', '-a', input, '-b', genes] if gaps: cmd.append(gaps) runSubprocess2(cmd, '.', log, output)
Let me know. Meanwhile, can you think of an alternative way to get around this without editing source? Thank you so much in advance!
The text was updated successfully, but these errors were encountered: