Funannotate exits in predict: bedtools intersect error #522

felipe797 · 2020-12-10T19:09:57Z

Hi Jon!
First of all, thanks for the pipeline, it's been really useful to me!

So, I ran into this problem in predict, where it's quitting after a bedtools intersect error. From logfile:

[09:39 AM]: Generating protein fasta files from 22,573 EVM models\n [09:39 AM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc).\n [09:39 AM]: CMD ERROR: bedtools intersect -f 0.9 -a\n /bigdata/stajichlab/fmoreira/Metarhizium_acridum/annotate/New_polish/polished/predict_misc/evm.round1.gff3 -b /bigdata/stajichlab/fmoreira/Metarhizium_acridum/annotate/New_polish/polished/predict_misc/repeatmasker.bed [09:39 AM]: (None, b'ERROR: Received illegal bin number -1 from getBin call.\nMaximum values is: 2396745\nThis typically means that your coordinates are\nnegative or too large to represent in the data\nstructure bedtools uses to find intersections.ERROR: Unable to add record to tree.\n')

I'm running the new versions of both funannotate (v1.8.2) and bedtools (v2.29.2).
After some troubleshooting, I found out bedtools intersect has problems with large scaffolds and the recommendation is, according to their readthedocs:

Note

If you are trying to intersect very large files and are having trouble with excessive memory usage, please presort your data by chromosome and then by start position (e.g., sort -k1,1 -k2,2n in.bed > in.sorted.bed for BED files) and then use the -sorted option. This invokes a memory-efficient algorithm designed for large files. This algorithm has been substantially improved in recent (>=2.18.0) releases.

I proceeded to sort both my repeatmasker.bed and evm.round1.gff3 and run bedtools intersect manually on the sorted new files, using the same command funannotate predict did and same error popped up :

bedtools intersect -f 0.9 -a evm.round1.gff3 -b repeatmasker.sorted.bed        

ERROR: Received illegal bin number -1 from getBin call.
Maximum values is: 2396745
This typically means that your coordinates are
negative or too large to represent in the data
structure bedtools uses to find intersections.ERROR: Unable to add record to tree.

But it ran smoothly in the same command but with the flag -sorted with the sorted files:

bedtools intersect -f 0.9 -a evm.round1.gff3 -b repeatmasker.sorted.bed -sorted

Not sure but maybe a suggestion would be conditional statement to check if infiles are sorted by chromosome and start position and if so run the command with -sorted?

in funannotate/library.py line 5656 :

def validate_tRNA(input, genes, gaps, output): cmd = ['bedtools', 'intersect', '-v', '-a', input, '-b', genes] if gaps: cmd.append(gaps) runSubprocess2(cmd, '.', log, output)

Let me know. Meanwhile, can you think of an alternative way to get around this without editing source? Thank you so much in advance!

The text was updated successfully, but these errors were encountered:

nextgenusfs · 2020-12-10T19:26:15Z

Hi @felipe797. Thanks for reporting. So you are saying that if we just add -sorted to the command that it will run? I can't remember off the top of my head if the inputs are sorted. One other option would be to not use bedtools to do this.

If you wanted to skip that step, you could pass --repeat_filter blast which will then skip this step of bedtools overlap filtering of repetitive gene models.

felipe797 · 2020-12-10T19:54:52Z

No, if you use -sorted when the files are not actually sorted it complains:

Error: Sorted input specified, but the file evm.round1.gff3~ has the following out of order record scaffold_26 EVM exon 3755 4003 . - . ID=evm.model.scaffold_26.3.exon2;Parent=evm.model.scaffold_26.3

I sorted the files manually using unix sort, but there's also sortBed option in bedtools.

nextgenusfs · 2020-12-10T19:58:09Z

Right. So let me look at it a little closer, but one issue might be the GFF3 parser after this -- there isn't a way to to maintain the GFF3 structure (gene --> mRNA --> exon --> CDS) with using either sort or sortBed, which then might cause some downstream problems. I could just not use bedtools to filter but rather use python interlap.

felipe797 · 2020-12-10T20:28:45Z

That is true.
I initially thought the problem was just the bed file, which would have been fine with -sort
I think interlap would do the job but probably be slower than bedtools for larger datasets ?

nextgenusfs · 2020-12-10T20:33:05Z

Yeah it might be slower with interlap. I just made it generated sorted input for the repeat filtering step as I'm not using those outputs directly, testing locally and then I can push those changes. I can also write a sorting "properly" with python and keep the desired GFF3 structure. So do you know if then it also dies on the tRNA step? Or likely you haven't gotten there yet I presume because it died on the repeat filtering.

felipe797 · 2020-12-10T20:59:55Z

Yeah, I didn't get there.
I'm running it now with blast in the repeat filtering as you suggested. I'll start an additional run it with the sorted files I generated just to see if there's any weird behaviour. Keep you updated.

nextgenusfs · 2020-12-10T21:04:52Z

Okay, here is a quick python function I'll use to sort properly, I don't think this will break bedtools sorting as it is just using as a tiebreaker the features and sorting those in the order I want them:

def sortGFFproper(input, output):
    # function to sort GFF3 file but maintain gene, mrna, exon, cds order
    data = []
    features = set()
    comments = []
    with open(input, 'r') as infile:
        for line in infile:
            if line.startswith('\n'):
                continue
            if line.startswith('#'):
                comments.append(line)
                continue
            line = line.rstrip()
            cols = line.split('\t')
            data.append(cols)
            features.add(cols[2])
    # build sort order dictionary for features
    order_map = {'gene': 0, 'mRNA': 1, 'transcript': 2, 'tRNA': 3, 'ncRNA': 4,
                 'rRNA': 5, 'pseudogene': 6, 'five_prime_utr': 7,
                 'five_prime_UTR': 8, 'exon': 9, 'CDS': 10,
                 'three_prime_utr': 11, 'three_prime_UTR': 12}
    idx = len(order_map)
    for x in features:
        if x not in order_map:
            order_map[x] = idx
            idx += 1
    # we can now sort
    sort_data = sorted(data, key=lambda x: (x[0], int(x[3]), order_map[x[2]]))
    # now we can write back out to file
    with open(output, 'w') as outfile:
        for y in comments:
            outfile.write(y)
        for x in sort_data:
            outfile.write('{}\n'.format('\t'.join(x)))

felipe797 · 2020-12-10T21:37:23Z

This looks great.
Thank you so much for the quick reply and solution.

hyphaltip · 2020-12-10T22:24:04Z

Not sure if gffutils has sort feature like this too?

nextgenusfs · 2020-12-10T22:25:57Z

@hyphaltip I'm still testing here, it didn't like my first attempt. But if works I think I'll use this sorting on all incoming GFFs as well as sometimes that can be one of the issues. I'll tag this issue when I push the working code.

hyphaltip · 2020-12-10T22:36:47Z

We were going to try a larger memory allocation for the job but it seems like if we can solve this with sorted it would be more efficient?

…522

nextgenusfs · 2020-12-10T23:20:08Z

Okay, this passed the test data. Added -sorted to both bedtools calls and sorting with the function above. Let me know if that reduces memory.

hyphaltip · 2020-12-13T03:37:26Z

I updated on the system @felipe797 so you can re-try and see if it behaves better?

felipe797 · 2020-12-13T21:03:37Z

@hyphaltip still giving the same error for the two files. I did a little investigation on the evm.round1.gff3 and it looks like it's still not sorted the way bedtools wants it.

tail -2 on the file I sorted manually before and bedtools intersect behaved on:

scaffold_99 EVM exon 115720 116052 . - . ID=evm.model.scaffold_99.34.exon1;Parent=evm.model.scaffold_99.34 scaffold_99 EVM CDS 115720 116052 . - 0 ID=cds.evm.model.scaffold_99.34;Parent=evm.model.scaffold_99.34

while tail -2 on the file generated on my current attempt:

scaffold_558 EVM exon 6650 11401 . + . ID=evm.model.scaffold_558.1.exon1;Parent=evm.model.scaffold_558.1 scaffold_558 EVM CDS 6650 11401 . + 0 ID=cds.evm.model.scaffold_558.1;Parent=evm.model.scaffold_558.1

Which is essentially the same as the file I was having trouble with before.
Maybe there's no way we can get the .gff3 to be sorted exactly the way bedtools expect without disrupting its structure.

nextgenusfs · 2020-12-13T21:11:14Z

Hmm, I used python sorted which should mimic unix sort -- I almost always prefer python natsort but in this case new that would change the scaffold order to what a human would expect (ie 99 before 558). I must not have had enough scaffolds in the test dataset to see this.

We can maybe be more explicit with the bedtools interest -sorted -g genome.file option. Or I can just unix sort them for this particular task.... not really sure why they would be different.

felipe797 · 2020-12-13T21:40:29Z

In this case I think what's causing trouble is the start position of the contigs. It's probably expecting upstream contigs to come before in the file (ie scaffold 99's CDS starts on 115720 and scaffold 558's on 6650, so the latter should come first).

nextgenusfs · 2020-12-13T21:43:03Z

I don't think so, its sort -k1,1 -k4,4n first, so that's the scaffolds first and then sorted by start position (which is what I did in python sort, but apparently it treats the scaffold names differently than unix/bedtools sort). It wants them in a specific order. One sec I'm testing just using bedtools sort and then converting back to the way in which I think it should be after. Should have a fix in a few minutes.

nextgenusfs · 2020-12-13T22:02:58Z

@hyphaltip try that latest, tests pass locally. Now using bedtools sort to sort for the intersection. For the tRNA step it does the same thing but then "properly" sorts those output after so we don't have potential issue with loading into the dictionary structure.

felipe797 · 2020-12-14T21:31:39Z

@nextgenusfs I gave it a another try with the updated version today and it worked perfectly.
Thank you!

nextgenusfs pushed a commit that referenced this issue Dec 10, 2020

add -sorted to bedtools commands/sort input prior to reduce mem usage #…

81bc950

…522

nextgenusfs pushed a commit that referenced this issue Dec 13, 2020

use bedtools sort for intersect files #522

05b1810

nextgenusfs closed this as completed Jan 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Funannotate exits in predict: bedtools intersect error #522

Funannotate exits in predict: bedtools intersect error #522

felipe797 commented Dec 10, 2020 •

edited

nextgenusfs commented Dec 10, 2020

felipe797 commented Dec 10, 2020

nextgenusfs commented Dec 10, 2020

felipe797 commented Dec 10, 2020 •

edited

nextgenusfs commented Dec 10, 2020

felipe797 commented Dec 10, 2020

nextgenusfs commented Dec 10, 2020 •

edited

felipe797 commented Dec 10, 2020

hyphaltip commented Dec 10, 2020

nextgenusfs commented Dec 10, 2020

hyphaltip commented Dec 10, 2020

nextgenusfs commented Dec 10, 2020

hyphaltip commented Dec 13, 2020

felipe797 commented Dec 13, 2020

nextgenusfs commented Dec 13, 2020

felipe797 commented Dec 13, 2020

nextgenusfs commented Dec 13, 2020

nextgenusfs commented Dec 13, 2020

felipe797 commented Dec 14, 2020

Funannotate exits in predict: bedtools intersect error #522

Funannotate exits in predict: bedtools intersect error #522

Comments

felipe797 commented Dec 10, 2020 • edited

nextgenusfs commented Dec 10, 2020

felipe797 commented Dec 10, 2020

nextgenusfs commented Dec 10, 2020

felipe797 commented Dec 10, 2020 • edited

nextgenusfs commented Dec 10, 2020

felipe797 commented Dec 10, 2020

nextgenusfs commented Dec 10, 2020 • edited

felipe797 commented Dec 10, 2020

hyphaltip commented Dec 10, 2020

nextgenusfs commented Dec 10, 2020

hyphaltip commented Dec 10, 2020

nextgenusfs commented Dec 10, 2020

hyphaltip commented Dec 13, 2020

felipe797 commented Dec 13, 2020

nextgenusfs commented Dec 13, 2020

felipe797 commented Dec 13, 2020

nextgenusfs commented Dec 13, 2020

nextgenusfs commented Dec 13, 2020

felipe797 commented Dec 14, 2020

felipe797 commented Dec 10, 2020 •

edited

felipe797 commented Dec 10, 2020 •

edited

nextgenusfs commented Dec 10, 2020 •

edited