Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

funannotate predict fails suddenly at diamond/exonerate step #503

Closed
estolle opened this issue Oct 17, 2020 · 14 comments
Closed

funannotate predict fails suddenly at diamond/exonerate step #503

estolle opened this issue Oct 17, 2020 · 14 comments

Comments

@estolle
Copy link

estolle commented Oct 17, 2020

Are you using the latest release?
v1.8.1 in conda env, one downgraded library (perl DB fix) "libdb=6.1.26"

Describe the bug
at the predict step funannotate fails, just after aligning (or during) proteins with diamnd/exonerate. It complains that the predict_misc/protein_alignments.gff3 is not found/doesnt exist. Its strange because the last couple of days (before downgrading libdb=6.1.26), I used the same set of proteins and the same conda env to annotate successully some other genomes. These genomes are much more contiguous with few Chromosome-sized sequences (up to 16 Mb). It runs fro quite a while so it is probably busy aligning. The current DIR has the tmp folder "p2g_9595" containing only diamond.dmnd, an empty "diamond.matches.tab" and empty folder "failed" and "scaffolds".

What command did you issue?

export OPENBLAS_NUM_THREADS=1
(
funannotate predict
-i $ASM
-o $OUT
-s "$SPECIES"
--name $NAME"_"
--busco_seed_species bombus_impatiens1
--busco_db hymenoptera
--cpus $CPUs
--keep_no_stops
--keep_evm
--SeqCenter XXXX
--organism other
--max_intronlen 10000
--optimize_augustus
--ploidy 1
--repeat_filter overlap blast
--soft_mask 5000
--protein_evidence
$INPUT2
$INPUT4
$INPUT5
$INPUT6
$INPUT7
$INPUT8
$INPUT9
$INPUT10
$INPUT11
$INPUT12
$INPUT13
$INPUT14
--transcript_evidence
$INPUT16
$INPUT17
)

Logfiles
cat funannotate-p2g.log
[10/16/20 22:00:50]: /home/ek/.conda/envs/funnotate181/lib/python3.7/site-packages/funannotate/aux_scripts/funannotate-p2g.py -p 2020_10_16_TetragonulaCarbonaria/predict_misc/proteins.combined.fa -g /scratch/ek/stingless.bee.genomics/annotation/funannotate/TetragonulaCarbonaria/2020_10_16_TetragonulaCarbonaria/predict_misc/genome.softmasked.fa -o 2020_10_16_TetragonulaCarbonaria/predict_misc/protein_alignments.gff3 --maxintron 10000 --cpus 20 --exonerate_pident 80 --ploidy 1 -f diamond --tblastn_out 2020_10_16_TetragonulaCarbonaria/predict_misc/p2g.diamond.out --logfile 2020_10_16_TetragonulaCarbonaria/logfiles/funannotate-p2g.log

[10/16/20 22:00:51]: Mapping 754,312 proteins to genome using diamond and exonerate
[10/16/20 22:00:51]: Diamond v2.0.4; Exonerate v2.4.0
[10/16/20 22:00:51]: diamond makedb --threads 8 --in /scratch/ek/stingless.bee.genomics/annotation/funannotate/TetragonulaCarbonaria/2020_10_16_TetragonulaCarbonaria/predict_misc/proteins.combined.fa --db diamond
[10/16/20 22:00:57]: diamond blastx --threads 8 -q /scratch/ek/stingless.bee.genomics/annotation/funannotate/TetragonulaCarbonaria/2020_10_16_TetragonulaCarbonaria/predict_misc/genome.softmasked.fa --db diamond -o diamond.matches.tab -e 1e-10 -k 0 --more-sensitive -f 6 sseqid slen sstart send qseqid qlen qstart qend pident length evalue score qcovhsp qframe
[10/17/20 06:43:09]: CMD ERROR: diamond blastx --threads 8 -q /scratch/ek/stingless.bee.genomics/annotation/funannotate/TetragonulaCarbonaria/2020_10_16_TetragonulaCarbonaria/predict_misc/genome.softmasked.fa --db diamond -o diamond.matches.tab -e 1e-10 -k 0 --more-sensitive -f 6 sseqid slen sstart send qseqid qlen qstart qend pident length evalue score qcovhsp qframe

OS/Install Information
Ubuntu 16.04

** error message**
[10:00 PM]: Mapping 754,312 proteins to genome using diamond and exonerate
[06:43 AM]: CMD ERROR: diamond blastx --threads 8 -q /scratch/ek/stingless.bee.genomics/annotation/funannotate/TetragonulaCarbonaria/2020_10_16_TetragonulaCarbonaria/predict_misc/genome.softmasked.fa --db diamond -o diamond.matches.tab -e 1e-10 -k 0 --more-sensitive -f 6 sseqid slen sstart send qseqid qlen qstart qend pident length evalue score qcovhsp qframe
b'diamond v2.0.4.142 (C) Max Planck Society for the Advancement of Science\nDocumentation, support and updates available at http://www.diamondsearch.org\n\n#CPU threads: 8\nScoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)\nTemporary directory: \nOpening the database... [0.159s]\n#Target sequences to report alignments for: unlimited\nReference = diamond.dmnd\nSequences = 754312\nLetters = 341564446\nBlock size = 2000000000\nOpening the input file... [
....
]\nSearching alignments... [7.341s]\nProcessing query block 1, reference block 1/1, shape 16/16, index chunk 4/4.\nBuilding reference seed array... [1.046s]\nBuilding query seed array... [1.506s]\nComputing hash join... [0.695s]\nBuilding seed filter... [0.085s]\nSearching alignments... [7.513s]\nDeallocating buffers... [0.078s]\nClearing query masking... [0.234s]\nComputing alignments... '
Traceback (most recent call last):
File "/home/ek/.conda/envs/funnotate181/bin/funannotate", line 688, in
main()
File "/home/ek/.conda/envs/funnotate181/bin/funannotate", line 678, in main
mod.main(arguments)
File "/home/ek/.conda/envs/funnotate181/lib/python3.7/site-packages/funannotate/predict.py", line 983, in main
lib.exonerate2hints(Exonerate, hintsP)
File "/home/ek/.conda/envs/funnotate181/lib/python3.7/site-packages/funannotate/library.py", line 3877, in exonerate2hints
with open(file, 'r') as input:
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/ek/stingless.bee.genomics/annotation/funannotate/TetragonulaCarbonaria/2020_10_16_TetragonulaCarbonaria/predict_misc/protein_alignments.gff3'

@estolle
Copy link
Author

estolle commented Oct 17, 2020

seems to be the same issue as #495

I am running the diamond command separately now:
diamond blastx --threads 80 -q /scratch/ek/stingless.bee.genomics/annotation/funannotate/TetragonulaCarbonaria/2020_10_16_TetragonulaCarbonaria/predict_misc/genome.softmasked.fa --db diamond -o diamond.matches.tab -e 1e-10 -k 0 --more-sensitive -f 6 sseqid slen sstart send qseqid qlen qstart qend pident length evalue score qcovhsp qframe

I perhaps also try a new annotation on one of the genome.fa which worked 2 days ago.

@estolle
Copy link
Author

estolle commented Oct 17, 2020

running the diamond alignment on its own gives a segfault:

Building query seed array... [0.42s]
Computing hash join... [0.215s]
Building seed filter... [0.038s]
Searching alignments... [1.321s]
Deallocating buffers... [0.096s]
Clearing query masking... [0.285s]
Computing alignments...
Segmentation fault (core dumped)

Not sure how to go on from here, i.e. to find out where the problem actually is

@nextgenusfs
Copy link
Owner

OKay, so must be a problem with that version of diamond. Can you try to install a different version? You could just test with that command outside of funannotate? It could also be the number of threads is too high, try something like 32 and see if it still segfaults.

@nextgenusfs
Copy link
Owner

Oh, never mind the threads, looks like it ran 8 within funannotate. So just try to downgrade a version or two:

diamond                        2.0.0      h31d8819_0  bioconda            
diamond                        2.0.1      h31d8819_0  bioconda            
diamond                        2.0.2      h31d8819_0  bioconda            
diamond                        2.0.4      h31d8819_0  bioconda    

@estolle
Copy link
Author

estolle commented Oct 20, 2020

A new version of diamond fixed the problem if I at the same time use the -F 15 option (frameshift mode). The error was initially caused by a very large contig (26 Mb, of which the first 7 Mb also caused the error).

I modified the p2g.py script accordingly and it successfully ran through.

I did this:

git clone https://github.com/bbuchfink/diamond.git
cd diamond
mkdir bin
cd bin
cmake ..
make -j8

mv $HOME/.conda/envs/funnotate181/bin/diamond $HOME/.conda/envs/funnotate181/bin/diamond.orig
cp $HOME/bin/diamond $HOME/.conda/envs/funnotate181/bin/diamond

cp $HOME/.conda/envs/funnotate181/lib/python3.7/site-packages/funannotate/aux_scripts/funannotate-p2g.py $HOME/.conda/envs/funnotate181/lib/python3.7/site-packages/funannotate/aux_scripts/funannotate-p2g.orig.py

nano $HOME/.conda/envs/funnotate181/lib/python3.7/site-packages/funannotate/aux_scripts/funannotate-p2g.py

search for blastx

add '--unal', '0', '-c1', '-F', '15',

@nextgenusfs
Copy link
Owner

Okay, thanks @estolle -- lets leave this open and continue to track issues. Unfortunately there are various small issues (which seem to change frequently) with diamond depending on the version. If there was a version that didn't have problems we could at least pin a specific version in the Conda recipe. I tried most recently to use the latest.... which did work fine for my local tests but were small test sets, so didn't catch the same error you found with a larger assembly.

The goal of this diamond search is to just "pre-screen" locations on the genome to run exonerate on, so maybe we should get the opinion of diamond developer on what the best command should be. Frameshift search would on the surface make sense, but if it yields 5X more sites to run exonerate on that won't yield a useful alignment then will add to run time. I know the current command produces a lot of regions where exonerate is unable to find a protein alignment > 80% identity, but its still many times faster than using tlbastn which has had multiprocessing broken for like 6 years.

@estolle
Copy link
Author

estolle commented Oct 22, 2020

It runs nicely as far as I can tell. It eems to find less prelim alignments (unless thats specific to these species) and therefpre is actually faster

Species 1 which previously failed:
[09:37 PM]: Mapping 754,312 proteins to genome using diamond and exonerate
[10:16 PM]: Found 815,045 preliminary alignments --> aligning with exonerate
[11:59 PM]: Exonerate finished: found 87,154 alignments

species 2 which previously failed:
[03:08 PM]: Mapping 754,312 proteins to genome using diamond and exonerate
[03:39 PM]: Found 766,336 preliminary alignments --> aligning with exonerate
[05:40 PM]: Exonerate finished: found 82,291 alignments

species3 with fragmented assembly which previously worked
[12:25 AM]: Mapping 754,312 proteins to genome using diamond and exonerate
[12:57 AM]: Found 2,341,552 preliminary alignments --> aligning with exonerate
[05:33 AM]: Exonerate finished: found 108,238 alignments

species4 which previously worked (also fragmented assembly)
[12:26 AM]: Mapping 754,312 proteins to genome using diamond and exonerate
[01:02 AM]: Found 4,527,214 preliminary alignments --> aligning with exonerate
[06:05 AM]: Exonerate finished: found 117,572 alignments

species5 (fragmented, worked previously)
[03:55 PM]: Mapping 754,312 proteins to genome using diamond and exonerate
[04:32 PM]: Found 1,291,597 preliminary alignments --> aligning with exonerate
[05:46 PM]: Exonerate finished: found 93,715 alignments

@nextgenusfs
Copy link
Owner

Great thanks for the follow up @estolle. I'd be okay with updating the diamond command -- but sounds like we have to wait for next release in conda. But that would also mean I'd have to tag it diamond >=2.0.5, which potentially could have some unintended consequences. Alternatively we could modify the diamond command based on the version, ie if > 2.0.5 then run in frameshift mode -- where we could then be maybe more lenient in conda recipe for versions.

@BenjaminSchwessinger
Copy link

Can you clarify what diamond version you recommend now. I ran into the same issue. I installed 2.0.5 via conda. Also is there a way to have diamond run with the indicated cpus provided in the funannoate predict --cpus command. Right now it seems to default to 8 even if I give it 20.

nextgenusfs pushed a commit that referenced this issue Nov 24, 2020
@nextgenusfs
Copy link
Owner

nextgenusfs commented Nov 24, 2020

Hi @BenjaminSchwessinger, @estolle, and @AMBedoyaO. Latest commit should default to frameshift mode as described above by @estolle if diamond >= 2.0.5. The limit to the cpus was an attempt to stop the RAM issue with diamond, I've also restored that to respect the --cpus argument if diamond >=2.0.5. It seems to run a lot faster. Install the master and see if that fixes it:

python -m pip install --no-deps --force git+https://github.com/nextgenusfs/funannotate.git

@BenjaminSchwessinger
Copy link

Great thanks. Test running now.

@tanoramb
Copy link

--unal', '0', '-c1', '-F', '15',

Thank you @estolle ! Your solution works smoothly!

@estolle
Copy link
Author

estolle commented Feb 12, 2021

Hi again

Apparently the author /developer of diamond just fixed the segfault issue in the latest diamond release (v2.0.7) (thus the actual underlying problem should be fixed now - the frameshift mode was only a workaround)
bbuchfink/diamond#399 (comment)

but I have not tested this yet

@nextgenusfs
Copy link
Owner

Cool. I'm fine with leaving in frameshift as I think the search is faster and it is doing a very similar thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants