Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLASTN causing crash/core dump with ~1% of samples (tested on 3.11.2 and 3.11.11) #118

Closed
dutchscientist opened this issue Apr 29, 2023 · 21 comments
Labels
bug Something isn't working can't reproduce Bug or apparent bug that we can't yet reproduce

Comments

@dutchscientist
Copy link

I am running >20k Salmonella genomes with AMRfinder using the "--plus" switch and "-O Salmonella". In about 1% of the samples it will crash once the BLASTN starts for the point mutation search; if I run it without the -O switch, it is fine with the same sequences. I first thought it could have to do with long contig names, but after running them through Prokka with renamed contig names, it still causes failures.

Below is the output when crashing:

*** ERROR ***
'/home/username/mambaforge/envs/genotyping/bin/blastn' -query 'Salm000001fna/Salm000048.fna' -db /tmp/amrfinder.4Qo99F/db/AMR_DNA-Salmonella -evalue 1e-20 -dust no -max_target_seqs 10000 -num_threads 2 -mt_mode 1 -outfmt '6 qseqid sseqid qstart qend qlen sstart send slen qseq sseq' -out /tmp/amrfinder.4Qo99F/blastn > /tmp/amrfinder.4Qo99F/log 2> /tmp/amrfinder.4Qo99F/blastn-err
status = 35584
Segmentation fault (core dumped)

Anything that can be done for this? Thanks :)

@vbrover
Copy link
Contributor

vbrover commented Apr 29, 2023

If you run

'/home/username/mambaforge/envs/genotyping/bin/blastn' -query 'Salm000001fna/Salm000048.fna' -db /tmp/amrfinder.4Qo99F/db/AMR_DNA-Salmonella -evalue 1e-20 -dust no -max_target_seqs 10000 -num_threads 2 -mt_mode 1 -outfmt '6 qseqid sseqid qstart qend qlen sstart send slen qseq sseq' -out xxx

do yo get the same crash?

What is the contents of the below files?

/tmp/amrfinder.4Qo99F/blastn 
/tmp/amrfinder.4Qo99F/log 
/tmp/amrfinder.4Qo99F/blastn-err

And what is the version of amrfinderand blastn?

@vbrover
Copy link
Contributor

vbrover commented Apr 29, 2023

What is the result of these commands?

ls -laF /home/username/mambaforge/envs/genotyping/bin/blastn
ls -laF Salm000001fna/Salm000048.fna
ls -laF /tmp/amrfinder.4Qo99F/db/AMR_DNA-Salmonella
ls -laF /tmp/amrfinder.4Qo99F/blastn 
ls -laF /tmp/amrfinder.4Qo99F/log 
ls -laF /tmp/amrfinder.4Qo99F/blastn-err

@dutchscientist
Copy link
Author

Tried it with amrfinder 3.11.11 (Python 3.7) and 3.11.2 (Python 3.10). The BLAST version is BLAST 2.13.0+ in both cases, running on Ubuntu 22.04 LTS in two Mamba environments. The Database version used is: 2023-04-17.1

With the commandline suggestion, I still get "Segmentation fault (core dumped)"

blastn: "" (empty, 0 bytes)
log: "" (empty, 0 bytes)
blastn-err: "Segmentation fault (core dumped)"

As said, the weird thing is it only happens in a minority of genomes,

@dutchscientist
Copy link
Author

dutchscientist commented Apr 29, 2023

-rwxrwxr-x 4 vetschool vetschool 276776 Jul 19 2022 /home/vetschool/mambaforge/envs/genomics/bin/blastn*
-rw-rw-r-- 1 vetschool vetschool 5093967 Apr 27 19:06 Salm000001fna/Salm000048.fna
-rw-rw-r-- 1 vetschool vetschool 1612 Apr 21 00:31 /tmp/amrfinder.4Qo99F/db/AMR_DNA-Salmonella
-rw-rw-r-- 1 vetschool vetschool 0 Apr 29 18:03 /tmp/amrfinder.4Qo99F/blastn
-rw-rw-r-- 1 vetschool vetschool 0 Apr 29 18:03 /tmp/amrfinder.4Qo99F/log
-rw-rw-r-- 1 vetschool vetschool 33 Apr 29 18:03 /tmp/amrfinder.4Qo99F/blastn-err

(username = vetschool, genotyping is the env for Python 3.10 which only allows amrfinder 3.11.2, genomics is the env for Python 3.7 which allows amrfinder 3.11.11)

@vbrover
Copy link
Contributor

vbrover commented Apr 29, 2023

Since the bug is reproducible, could you post Salm000001fna/Salm000048.fna?

Can you try BLASTN ver. 2.14.0+?

@dutchscientist
Copy link
Author

Blast 2.14 is not available yet via Conda/Mamba?

The Salm000048.fna file is available on https://drive.google.com/file/d/11JmHcvVhjvgJz1JFxrokD7PIvyjOw3Rv/view?usp=sharing.

@dutchscientist
Copy link
Author

Salm000048.zip

@vbrover
Copy link
Contributor

vbrover commented Apr 29, 2023

I have tried blastn ver. 2.13.0+ and 2.14.0+ and the both worked on Salm000048.fna with exit code 0 and an empty output file.

Let's check that the blast database is available. What is the result of this command?

ls -laF /tmp/amrfinder.4Qo99F/db/AMR_DNA-Salmonella*

Is there enough disk space?

@vbrover
Copy link
Contributor

vbrover commented Apr 29, 2023

AMR_DNA-Salmonella*

@dutchscientist
Copy link
Author

dutchscientist commented Apr 29, 2023

-rw-rw-r-- 1 vetschool vetschool 1612 Apr 26 17:20 /tmp/amrfinder.Bvch0H/db/AMR_DNA-Salmonella
-rw-rw-r-- 1 vetschool vetschool 20480 Apr 26 17:21 /tmp/amrfinder.Bvch0H/db/AMR_DNA-Salmonella.ndb
-rw-rw-r-- 1 vetschool vetschool 117 Apr 26 17:21 /tmp/amrfinder.Bvch0H/db/AMR_DNA-Salmonella.nhr
-rw-rw-r-- 1 vetschool vetschool 160 Apr 26 17:21 /tmp/amrfinder.Bvch0H/db/AMR_DNA-Salmonella.nin
-rw-rw-r-- 1 vetschool vetschool 572 Apr 26 17:21 /tmp/amrfinder.Bvch0H/db/AMR_DNA-Salmonella.njs
-rw-rw-r-- 1 vetschool vetschool 20 Apr 26 17:21 /tmp/amrfinder.Bvch0H/db/AMR_DNA-Salmonella.not
-rw-rw-r-- 1 vetschool vetschool 386 Apr 26 17:21 /tmp/amrfinder.Bvch0H/db/AMR_DNA-Salmonella.nsq
-rw-rw-r-- 1 vetschool vetschool 16384 Apr 26 17:21 /tmp/amrfinder.Bvch0H/db/AMR_DNA-Salmonella.ntf
-rw-rw-r-- 1 vetschool vetschool 8 Apr 26 17:21 /tmp/amrfinder.Bvch0H/db/AMR_DNA-Salmonella.nto
-rw-rw-r-- 1 vetschool vetschool 406 Apr 26 17:20 /tmp/amrfinder.Bvch0H/db/AMR_DNA-Salmonella.tab

I rebooted the computer, now Salm000048.fna did work, took the next one (Salm000070.fna) which does crash again, hence the change in code Bvch0H.

The output of df -h is:
Filesystem Size Used Avail Use% Mounted on
tmpfs 4.8G 1.4M 4.8G 1% /run
/dev/sda1 473G 162G 287G 37% /
tmpfs 24G 0 24G 0% /dev/shm
tmpfs 5.0M 4.0K 5.0M 1% /run/lock
virtualbox_shared 7.3T 1.5T 5.9T 20% /media/sf_virtualbox_shared
tmpfs 4.8G 124K 4.8G 1% /run/user/1001
Plenty of space, >250 GB.

(virtualbox Ubuntu 22.04 LTS computer running in Windows)

@dutchscientist
Copy link
Author

And now the next one works after a few tries. This is very irritating!

Thanks very much for your assistance, by the way!

@vbrover
Copy link
Contributor

vbrover commented Apr 29, 2023

Your blastn has size 276776 whereas on my computer:

$ ls -laF blast/ncbi-blast-2.13.0+/bin/blastn
-rwxr-xr-x 1 brovervv pathogen 28839896 Feb  2  2022 blast/ncbi-blast-2.13.0+/bin/blastn*

@vbrover
Copy link
Contributor

vbrover commented Apr 29, 2023

Are you working on a Windows computer emulating Ubuntu?

@dutchscientist
Copy link
Author

I am working on a Windows 10 computer with Virtualbox 7.08, and a virtual Ubuntu 22.04 Linux computer with Conda (Mamba) environments. So it is Linux, not emulating.

I had a look at https://anaconda.org/bioconda/blast/files, and the BLAST file size is OK there? Blastn is about 270 kb.

@vbrover
Copy link
Contributor

vbrover commented Apr 29, 2023

I will pass this issue to those who understand blast better on Monday.

@dutchscientist
Copy link
Author

And it works absolutely fine if I leave out the -O Salmonella out, the only thing I am missing then is the point mutation resistances. But I can do those with pointfinder.

Thanks for your help!

@evolarjun
Copy link
Contributor

I thought it might be due to the blast in bioconda, which applies several minor patches to the blast source, but I wasn't able to reproduce the issue with two versions of blast from bioconda. Still a mystery to me.

@vbrover
Copy link
Contributor

vbrover commented May 6, 2023

You can download NCBI BLAST from https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ and use the amrfinder option --blast_bin BLAST_DIR.

@dutchscientist
Copy link
Author

Thanks! It did notice that each time I re-ran the failed ones, a few would do it suddenly (say 5% of the samples), and then I moved to another virtual computer and all the remaining samples worked fine. I am about to do another big batch again, will try this and report back.

@dutchscientist
Copy link
Author

dutchscientist commented May 9, 2023

I have reformatted the headers and files with SeqFu (https://github.com/telatin/seqfu2):
fu-multirelabel -r genomename -n genome-00001.fna --no-comments > genome00001.fna
(I previously used Prokka-generated FASTA files)

I still use BLAST+ 2.13.0, but now I have not had dropouts anymore, except for 1 genome that ran fine when done again. All the previous "problem makers" like Salm000048.fna ran absolutely fine.

The only difference I can seen between the Prokka- and Seqfu-generated files is that with Prokka it's 60 bases per line (like Genbank downloads), whereas with Seqfu everything is on a single line, no line breaks per contig.

Anyway, just ran 40k of the 50k Salmonella genomes without a hiccup (still running), so problem seems to have been resolved. Happy to close it, thanks for the assistance!

@evolarjun
Copy link
Contributor

I'm glad you got it working!

Thanks for the clue about line length, and thanks for your patience. We'll take a look and see if we can figure anything out. At least we have a potential fix if we hear of other people having the issue and a clue as to what could be breaking.

Thanks again for reporting and giving us all the details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working can't reproduce Bug or apparent bug that we can't yet reproduce
Projects
None yet
Development

No branches or pull requests

3 participants