Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fasterq-dump job hanged and killed multiple server #463

Closed
SilasK opened this issue Feb 1, 2021 · 28 comments
Closed

fasterq-dump job hanged and killed multiple server #463

SilasK opened this issue Feb 1, 2021 · 28 comments
Assignees

Comments

@SilasK
Copy link

SilasK commented Feb 1, 2021

Dear NCBI developer.

I wanted to use the new fasterq-dum tool to download 1000 fastq files. As recommended I used prefetch to get a sra files. Then I tried to extract the fastq files with fasterq-dump. For this, I submitted multiple jobs on a slurm cluster to speed up the process.

However many of the jobs hanged why many clusters had to be rebooted. Do you have any idea why this could happen.

Hee is my script:

cd SRAreads
ls ERR1293886/ERR1293886.sra
fasterq-dump --threads 4 --temp /tmp/fasterqdum/ --log-level info --progress --print-read-nr ERR1293886 

Here is a log

2021-01-28T06:12:03 fasterq-dump.2.10.9 debug: path not found while opening node within configuration module - no image guid
2021-01-28T06:12:03 fasterq-dump.2.10.9 debug: path not found while opening node within configuration module - no image guid
join   :2021-01-28T06:12:03 fasterq-dump.2.10.9 debug: requesting guard size 1038336, default was 4096
|  0.00%2021-01-28T06:12:03 fasterq-dump.2.10.9 debug: requesting stack size 16777216, default was 2097152
2021-01-28T06:12:03 fasterq-dump.2.10.9 debug: requesting guard size 1038336, default was 4096
2021-01-28T06:12:03 fasterq-dump.2.10.9 debug: requesting stack size 16777216, default was 2097152
2021-01-28T06:12:03 fasterq-dump.2.10.9 debug: requesting guard size 1038336, default was 4096
2021-01-28T06:12:03 fasterq-dump.2.10.9 debug: requesting stack size 16777216, default was 2097152
2021-01-28T06:12:03 fasterq-dump.2.10.9 debug: requesting guard size 1038336, default was 4096
2021-01-28T06:12:03 fasterq-dump.2.10.9 debug: requesting stack size 16777216, default was 2097152
2021-01-28T06:12:03 fasterq-dump.2.10.9 debug: requesting guard size 1038336, default was 4096

Slurm log:

slurmstepd: error: *** JOB 42698933 ON node242 CANCELLED AT 2021-01-27T23:42:58 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 42698933 STEPD TERMINATED ON node242 AT 2021-01-27T23:44:58 DUE TO JOB NOT ENDING WITH SIGNALS ***

The system administrator explained me the problem:

From a Linux point of view, your process was in D state ("Waiting in
uninterruptible disk sleep", cf. man 5 proc), which is the worst state
you can have since the only "solution" to get rid of it is to reboot.
Moreover, a simple ps a hanged as well, thus no administrative tasks
could be performed.

Here the relevant Slurm part, where you can see the errors since all the
processes associated with the job can not be killed.

@SilasK
Copy link
Author

SilasK commented Feb 1, 2021

I now using https://github.com/rvalieris/parallel-fastq-dump which works like a charm

@SilasK
Copy link
Author

SilasK commented Feb 1, 2021

I had an idea, could it be that fasterq-dump creates non-unique temp files that might be overwritten by multiple processes writing on the same files?

@wraetz
Copy link
Contributor

wraetz commented Feb 1, 2021

fasterq-dump does create temporary files based on a tuple of machine-name, user-name and pid.
I think we went the extra length to avoid having the problem of overwriting by multiple processes.
What happens if you are running many fastq/fasterq-dump instances in parallel is that you are putting a lot of load on our download servers.
What parallel-fastq-dump does is doing - is the same procedure we are recommending: first prefetch the accession, then fastq/fasterq-dump the local copy.

In case of parallel-fastq-dump: the tool downloads the accession first, then runs multiple instances of fastq-dump in parallel on the local copy. Each instance with a slice of the full row-range. Then it concatenates the pieces.

That is exactly what fasterq-dump is doing ( minus the prefetch ), in C instead of python. For unaligned accessions this will result in about the same speed. For aligned accessions however fasterq-dump will be faster, because there is an inner join to be made to assemble the FASTQ output. Fasterq-dump avoids this inner join by creating a lookup-table on disk as the first step. Because of that - fasterq-dump cannot be run on a slice of rows - each slice would result in a full table-scan and erase the speed advantage of the whole approach.

What I would recommend for best speed on multiple servers: a combination of prefetch and fasterq-dump.
But do not download all the accessions you need with prefetch in parallel - our servers will see this coming from the same ip-address as a DOS attempt and throttle you. You should download the same way a download-manager would do it: if you have 1000 accessions to process, download maybe 5 of them in parallel. Be aware that prefetch is allowed to fail. Check the return code of prefetch and repeat the download if it is not zero. Prefetch will not restart the whole download, it will pick up at the position where it failed. As soon as you have some accession locally available, call fasterq-dump for them, on one of your machines. Prefetch will create a directory for each downloaded accession. Specify this directory as argument for fasterq-dump, not just the accession or the .sra-file inside it.

Assuming you have downloaded an accession into a directory '/path/to/where/it/is/stored'
Do this:
cd /path/to/where/it/is/stored
fasterq-dump SRR1234567

More details are to be found here:

https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump
and
https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump

If you have the luxury of a RAM-disk ( aka /dev/shm ) that is big enough - make use of it as scratch-space, it is worth it speed-wise!

@SilasK
Copy link
Author

SilasK commented Feb 1, 2021

Sorry if I wasn't clear. I used prefetch before.

@klymenko
Copy link
Contributor

klymenko commented Feb 9, 2021

@SilasK, do you still need help?

@SilasK
Copy link
Author

SilasK commented Feb 9, 2021 via email

@wraetz
Copy link
Contributor

wraetz commented Feb 9, 2021

Regarding how fasterq-dump and killed your server ( slurm-cluster):

  • how many parallel jobs did you run?
  • what accessions did you try to convert?
  • do the nodes of your cluster have internet-access?
  • does the user submitting the job have a synchronized home-directory on the cluster?
  • can you please send me your sbatch - script

I have created a small test-cluster with slurm - and I was not able to kill it with fasterq-dump....
I am interested in replicating the situation...

@SilasK
Copy link
Author

SilasK commented Feb 9, 2021

  • how many parallel jobs did you run?
    < 50
  • what accessions did you try to convert?
    In total 1000, see below
  • do the nodes of your cluster have internet-access?
    yes 1GB bandwidth
  • does the user submitting the job have a synchronized home-directory on the cluster?
    Yes
  • can you please send me your sbatch - script
    I used the following snakemake:

https://github.com/metagenome-atlas/atlas/blob/1e6994c694476b47ad080013c85979d1f00906fb/atlas/rules/sra.smk

It's part of a bigger pipeline, but If you have conda installed, I can help you to reproduce the error with ~5 commands.

See indexes below:

ERR414334 ERR414343 ERR414344 ERR414345 ERR414348 ERR414352 ERR414359 ERR414360 ERR414442 ERR504975 ERR505104 ERR525695 ERR525702 ERR525718 ERR525746 ERR525770 ERR525773 ERR525777 ERR525801 ERR525817 ERR525824 ERR525825 ERR525843 ERR525844 ERR525870 ERR525873 ERR525886 ERR525899 ERR525905 ERR525915 ERR525924 ERR525926 ERR525936 ERR525940 ERR525944 ERR525971 ERR525991 ERR526008 ERR526023 ERR526032 ERR526084 ERR589442 ERR589443 ERR589552 ERR589857 ERR688508 ERR688509 ERR688519 ERR688535 ERR688537 ERR688557 ERR688615 ERR688643 ERR866585 ERR866597 ERR911962 ERR970243 ERR970274 ERR970276 ERR970315 ERR970318 ERR973063 ERR973066 ERR973070 ERR973084 ERR973104 ERR973109 SRR1046171 SRR1046172 SRR1046199 SRR1046204 SRR1046822 SRR1196447 SRR1196449 SRR1196451 SRR1196456 SRR1196578 SRR1196591 SRR1196613 SRR1196637 SRR1197017 SRR1197019 SRR1197025 SRR1197032 SRR1197035 SRR1197043 SRR1197050 SRR1197054 SRR1197062 SRR1197072 SRR1197083 SRR1197086 SRR1197515 SRR1197519 SRR1197531 SRR1197540 SRR1197584 SRR1197587 SRR1197619 SRR1197625 SRR1197703 SRR1197722 SRR1201481 SRR1201555 SRR1211416 SRR1211496 SRR1212170 SRR1212192 SRR1212273 SRR1212291 SRR1212726 SRR1214490 SRR1214522 SRR1214535 SRR1214724 SRR1214867 SRR1214948 SRR1214975 SRR1215333 SRR1215570 SRR1215631 SRR1215926 SRR1215951 SRR1215963 SRR1219365 SRR1219503 SRR1219606 SRR1437790 SRR1437803 SRR1437965 SRR1437994 SRR1438037 SRR1438054 SRR1449335 SRR1449418 SRR1449419 SRR1449517 SRR1449536 SRR1449622 SRR1449744 SRR1449782 SRR1461770 SRR1461774 SRR1461786 SRR1461828 SRR1461927 SRR1462016 SRR1462033 SRR1462089 SRR1462453 SRR1462488 SRR1462493 SRR1462559 SRR1462561 SRR1462585 SRR1462616 SRR1462660 SRR1462693 SRR1462738 SRR1462750 SRR1462780 SRR1462784 SRR1462806 SRR1482518 SRR1482556 SRR1483255 SRR1483404 SRR1490836 SRR1490879 SRR1490880 SRR1490894 SRR1490900 SRR1490905 SRR1490908 SRR1490919 SRR1490940 SRR1490967 SRR1490973 SRR1490979 SRR1490981 SRR1490994 SRR1491010 SRR1491076 SRR1491095 SRR1491146 SRR1491185 SRR1491249 SRR1491378 SRR1491412 SRR1491443 SRR1491456 SRR1491466 SRR1491542 SRR1491566 SRR1491600 SRR1491614 SRR1491638 SRR1491640 SRR1491681 SRR1491703 SRR1491746 SRR1491749 SRR1492018 SRR1492052 SRR1492298 SRR1492320 SRR1492332 SRR1492500 SRR1493030 SRR1501076 SRR1757110 SRR1757176 SRR1757263 SRR1761671 SRR1761677 SRR1761699 SRR1761711 SRR1761717 SRR1765179 SRR1765190 SRR1765194 SRR1765277 SRR1765294 SRR1765295 SRR1765332 SRR1765354 SRR1765367 SRR1765371 SRR1765375 SRR1765386 SRR1765419 SRR1765430 SRR1765456 SRR1765471 SRR1765476 SRR1765487 SRR1765516 SRR1765572 SRR1765603 SRR1765633 SRR1765646 SRR1779103 SRR1779118 SRR1779131 SRR1779134 SRR1779139 SRR1793377 SRR1793410 SRR1909481 SRR1909520 SRR1909590 SRR1910592 SRR1910668 SRR1910683 SRR1915588 SRR1915633 SRR1915657 SRR1915664 SRR1915679 SRR1915769 SRR1915770 SRR1915782 SRR1915838 SRR1915869 SRR1915904 SRR1915996 SRR1916008 SRR1916030 SRR1916034 SRR1916043 SRR1916443 SRR1916453 SRR1916844 SRR1918830 SRR1918927 SRR1931173 SRR1931177 SRR1931178 SRR1980153 SRR1980306 SRR1980317 SRR1980318 SRR1980325 SRR1980337 SRR1980340 SRR1980349 SRR1980382 SRR1980390 SRR1980404 SRR1980433 SRR1980437 SRR1980445 SRR1980465 SRR1980468 SRR1980494 SRR1980504 SRR203317 SRR2047840 SRR2155179 SRR2155180 SRR2155309 SRR2155311 SRR2155315 SRR2155326 SRR2155347 SRR2155349 SRR2155372 SRR2155395 SRR2674234 SRR2726442 SRR2857886 SRR2912778 SRR2912781 SRR2912804 SRR2992906 SRR2992918 SRR2992921 SRR2992922 SRR2992943 SRR2992958 SRR2992959 SRR3108079 SRR3110879 SRR3131712 SRR3131728 SRR3131750 SRR3131788 SRR3131792 SRR3131846 SRR3131937 SRR3131961 SRR3131983 SRR3132001 SRR3132009 SRR3132015 SRR3132017 SRR3132027 SRR3132059 SRR3132097 SRR3132135 SRR3132203 SRR3132209 SRR3132231 SRR3132277 SRR3132301 SRR3132337 SRR3132353 SRR3132355 SRR3132373 SRR3132387 SRR3132429 SRR3132451 SRR3160441 SRR3160444 SRR3160450 SRR341587 SRR341588 SRR341600 SRR341650 SRR341669 SRR341684 SRR341715 SRR341725 SRR3546780 SRR3582135 SRR3582162 SRR3726350 SRR3726360 SRR3726374 SRR3726375 SRR3737017 SRR3737024 SRR3737029 SRR3737030 SRR3917692 SRR3992964 SRR3993009 SRR4033065 SRR4033067 SRR4305044 SRR4305046 SRR4305053 SRR4305060 SRR4305067 SRR4305071 SRR4305074 SRR4305076 SRR4305080 SRR4305095 SRR4305110 SRR4305128 SRR4305142 SRR4305158 SRR4305166 SRR4305176 SRR4305188 SRR4305194 SRR4305205 SRR4305208 SRR4305210 SRR4305225 SRR4305227 SRR4305230 SRR4305257 SRR4305258 SRR4305264 SRR4305276 SRR4305317 SRR4305318 SRR4305324 SRR4305332 SRR4305355 SRR4305361 SRR4305401 SRR4305406 SRR4305407 SRR4305417 SRR4305426 SRR4305427 SRR4305429 SRR4305433 SRR4305440 SRR4305520 SRR4305523 SRR4408009 SRR4408012 SRR4408018 SRR4408019 SRR4408028 SRR4408033 SRR4408037 SRR4408067 SRR4408079 SRR4408091 SRR4408098 SRR4408101 SRR4408109 SRR4408126 SRR4408127 SRR4408160 SRR4408208 SRR4408209 SRR4408228 SRR4408252 SRR4423586 SRR4423591 SRR4423608 SRR4423612 SRR4423615 SRR4423633 SRR4423651 SRR4423667 SRR4423686 SRR4423690 SRR4435689 SRR4435706 SRR4435720 SRR4435726 SRR4435750 SRR4435775 SRR4435780 SRR4435786 SRR4435792 SRR4435795 SRR4435809 SRR4444744 SRR4444752 SRR4444757 SRR4444811 SRR4444836 SRR4444867 SRR4451535 SRR4451550 SRR4451574 SRR4451581 SRR4451595 SRR4451601 SRR4451606 SRR4451661 SRR4481685 SRR4481714 SRR4481735 SRR4481745 SRR4481747 SRR4481752 SRR4481758 SRR4481761 SRR4481764 SRR4481771 SRR4481772 SRR4481793 SRR4481809 SRR4783386 SRR4783392 SRR4783406 SRR4783487 SRR4783494 SRR4783496 SRR4783505 SRR4783508 SRR4783512 SRR4783513 SRR4783527 SRR4783567 SRR4783573 SRR4783597 SRR4783609 SRR4783627 SRR492182 SRR5032274 SRR5032307 SRR5032313 SRR5032314 SRR5032317 SRR5032321 SRR5032328 SRR5056660 SRR5056684 SRR5056690 SRR5056691 SRR5056693 SRR5056716 SRR5056723 SRR5056729 SRR5056739 SRR5056764 SRR5056795 SRR5056802 SRR5056810 SRR5056870 SRR5056900 SRR5056958 SRR5056969 SRR5056972 SRR5056988 SRR5057055 SRR5057063 SRR5057065 SRR5057069 SRR5057076 SRR5057091 SRR5057108 SRR5057127 SRR5058923 SRR5091463 SRR5091520 SRR5106272 SRR5106277 SRR5106287 SRR5106301 SRR5106307 SRR5106318 SRR5106332 SRR5106402 SRR5106430 SRR5106465 SRR5127412 SRR5127465 SRR5127479 SRR5127490 SRR5127501 SRR5127509 SRR5127513 SRR5127520 SRR5127558 SRR5127584 SRR5127601 SRR5127649 SRR5127651 SRR5127669 SRR5127676 SRR5127690 SRR5127719 SRR5127742 SRR5127800 SRR5127805 SRR5127833 SRR5127849 SRR5127851 SRR5127857 SRR5275408 SRR5275414 SRR5275437 SRR5275452 SRR5275460 SRR5275464 SRR5275474 SRR5275478 SRR5275481 SRR5275484 SRR5279220 SRR5279249 SRR5279259 SRR5279262 SRR5279275 SRR5279276 SRR5279281 SRR5279292 SRR5279298 SRR5279304 SRR5558036 SRR5558037 SRR5558047 SRR5558060 SRR5558063 SRR5558077 SRR5558078 SRR5558084 SRR5558095 SRR5558105 SRR5558113 SRR5558123 SRR5558138 SRR5558150 SRR5558158 SRR5558161 SRR5558200 SRR5558201 SRR5558204 SRR5558216 SRR5558241 SRR5558253 SRR5558257 SRR5558266 SRR5558281 SRR5558282 SRR5558296 SRR5558302 SRR5558325 SRR5558337 SRR5558342 SRR5558377 SRR5558379 SRR5558395 SRR5558402 SRR5558408 SRR5579976 SRR5579988 SRR5580016 SRR5580046 SRR5580051 SRR5580062 SRR5580063 SRR5580085 SRR5580109 SRR585726 SRR585783 SRR5963134 SRR5963145 SRR5963156 SRR5963178 SRR5963179 SRR5963190 SRR5963192 SRR5963197 SRR5963209 SRR5963226 SRR5963240 SRR5963286 SRR5963295 SRR5963304 SRR5963306 SRR5963315 SRR5963324 SRR5963346 SRR5963357 SRR6028180 SRR6028192 SRR6028205 SRR6028206 SRR6028217 SRR6028219 SRR6028258 SRR6028264 SRR6028269 SRR6028278 SRR6028289 SRR6028300 SRR6028303 SRR6028305 SRR6028346 SRR6028378 SRR6028398 SRR6028401 SRR6028402 SRR6028431 SRR6028439 SRR6028484 SRR6028486 SRR6028521 SRR6028531 SRR6028538 SRR6028540 SRR6028554 SRR6028576 SRR6028596 SRR6028602 SRR6028603 SRR6028604 SRR6028621 SRR6028657 SRR6038266 SRR6038299 SRR6038355 SRR6038482 SRR6038529 SRR6054420 SRR6054447 SRR6054451 SRR6054484 SRR6257380 SRR6257386 SRR6257395 SRR6257410 SRR6257433 SRR6257435 SRR6257461 SRR6257473 SRR6257474 SRR6257486 SRR769516 SRR769529 SRR828661 SRR918060 SRR924754 SRR924777 SRR924791 SRR935340 SRR935356

@HenrikBengtsson
Copy link

We're experiencing the same problem where a user who calls fasterq-dump can take down our CentOS 7 machines - happened every other months for a year now). When this happens, commands like w and ps are hanging the console. sudo ls -l /proc/* | grep faster still works, which is why we haven't been able to narrow it down to fasterq-dump each time. This happens both when users install their own version for Conda and when using the fasterq.2.10.9-centos_linux64 we installed.

Right now we're looking into mitigating this problem by forcing fasterq-dump to run single-threaded. We understand it's suboptimal, but the alternative to crashing hosts is not good. We can't also remove the tool, because then users will go and install their own version.

AFAIU, there is no way to control the number of threads that fasterq-dump use, other than via CLI option:

  -e|--threads <count>             how many threads to use (dflt=6)

Ask/wish: Could please consider adding support for overriding the default six (6) threads via an environment variable, e.g. FASTERQ_DUMP_THREADS=1? That way we would be able to avoid crashes by default. Crashes would still happen if user would specify --threads <count>.

Question: Another workaround we're considering is defining a wrapper Bash function fasterq-dump() that calls the fasterq-dump executable with an explicit --threads 1 before passing on the other CLI arguments. Do I need to drop user's --threads n, or will fasterq-dump --threads 1 --threads 10 ... end up using a single core and ignore the 10, or is it vice versa?

Thxs

@erhoppe
Copy link

erhoppe commented May 19, 2021

Fasterq-dump has also necessitated reboots of our servers, and strangely of the home directory server despite my script writing the temporary files and outputs to directories on other servers (verified that they are indeed being written there) and running it on other nodes. I'm not the admin so I don't know the specifics (they said that "ps -ef" hanging was the most common issue), but It's very odd behavior. I've been using fasterq-dump because the ability to use multiple threads (typically use 12-18) makes it substantially faster at downloads than prefetch in my testing. However, the sysadmins are now disabling fasterq-dump until a solution can be figured out.

@HenrikBengtsson
Copy link

@erhoppe, thanks for sharing. Do you know if you are also have a BeeGFS file system (as others with problem reported elsewhere on this issue tracker)? If you don't know, try with:

$ beegfs-ctl | head -1
BeeGFS Command-Line Control Tool (http://www.beegfs.com)

or see if you have some other CLI BeeGFS tools, e.g. beegfs-df and beegfs-net.

@erhoppe
Copy link

erhoppe commented May 19, 2021

@HenrikBengtsson , none of those commands were found but I did find one reference to BeeGFS in the wiki from 2019 in reference to the file systems I am using with fasterq-dump (scratch/temporary storage) so...possibly. I'm uncertain if they're also using BeeGFS for the other file systems (e.g. the home directory which is located elsewhere) since they're not referenced, but I will ask.

@erhoppe
Copy link

erhoppe commented May 19, 2021

Update: BeeGFS is indeed what they're using for the scratch storage. According to our sysadmins, BeeGFS has found the bug and is working on a fix! (Apparently, other programs have caused the issue as well, though fasterq-dump is the most common.)

I expect this issue can be closed since it's not fasterq-dump's fault, though it probably would be useful to add the functionality to override threads (though as a user, it would be nice to pair it with a notification of this behavior since I probably would have been quite alarmed to see it running on a single core).

@timeu
Copy link

timeu commented Oct 13, 2021

@erhoppe: Do you maybe have a link to the Beegfs bug ?

@erhoppe
Copy link

erhoppe commented Oct 13, 2021

@timeu , I'm afraid not. I believe they were corresponding with them through email, and I also haven't checked in with my sysadmins in a while since I've been working on other things. If you're having the same issue, let me know and I can reach out to them to see if they've heard anything.

@timeu
Copy link

timeu commented Oct 13, 2021

@erhoppe : Thanks for the quick response. Yes we are seeing the exact issue.
A single user managed to drain almost 50 nodes as soon as the job landed on the node.
It would be great if you could reach out and see if they have heard anything.
Thanks!

@durbrow
Copy link
Collaborator

durbrow commented Oct 13, 2021

though as a user, it would be nice to pair it with a notification of this behavior since I probably would have been quite alarmed to see it running on a single core

The file system is assumed to be POSIX compliant for all Unix and Unix-like hosts. There isn't any general means for user programs to detect that a file system has bugs that it can't work with, particularly if the nature of the bug is to hang the process or the host.

@HenrikBengtsson
Copy link

HenrikBengtsson commented Oct 13, 2021

Hi (again).

Ad-hoc workaround attempt

Our current approach for an attempt to avoid this problem is to declare a Bash function fasterq-dump globally that injects --threads and --temp options;

fasterq-dump() { 
    command fasterq-dump --threads 2 --temp "$(mktemp -d)" "$@"
}

The hope is that by running with minimal number of threads (= two, cf. ) and towards $TMPDIR instead of the current working directory (often BeeGFS), is that it will significantly lower the risk for hitting the BeeGFS stall/crash/failure. Tests have show that it is not sufficient to set only --temp; the number of parallel threads appears to have an affect too and the higher the more likely it is that the node failure occurs.

The above Bash function obviously only works in Bash. Sysadms need to do similar things for other shells they support on their systems.

Caveat?

It is not clear to me what settings the fasterq-dump executable will use if the user calls:

fasterq-dump --threads 4 --temp /path/to some thing 

which then end up calling

fasterq-dump --threads 2 --temp "$(mktemp -d)" --threads 4 --temp /path/to some thing 

Will the fasterq-dump executable handle this, and will it use the first or second instance of the options. I hope the second, so that end-users can override our workaround.

Wish

Instead of the above hack, I wish it would be possible to control the defaults via environment variables. For example, setting:

export FASTERQ_DUMP_THREADS=2
export FASTERQ_DUMP_TEMP=$(mktemp -d)

globally would achieve the same as the above Bash function and it would regardlss of shell.

A related wish, is that fasterq-dump would support single-threaded processing. Currently, if you use --threads 1 it is silently ignored and it will fall back to the default of using six threads (=--threads 6), cf. #494. If --threads 1 will not be supported, I argue it should produce an informative error message, instead of silently ignore the option.

Motivation

We know compute hosts that run BeeGFS go down all the time around the world because of users using fasterq-dump. One by one, do the sysadms of these systems figure out that it has to do with fasterq-dump. They keep rebooting machines and they keep trying to inform users to use fastq-dump instead of fasterq-dump, because the latter crashes the machines. Of course, users don't read all documentation and all announcements and there will always come new users who don't know and they can single handed bring down compute hosts. Then we have all the sysadms that still haven't figured out what causes their compute hosts to crash. They are either currently pulling their hair out, or they will in the future. There's a lot(!) of compute and human resources wasted here.

If it is a BeeGFS bug (has that actually been confirmed and reported?) that will eventually be fixed in BeeGFS, the process of upgrading parallel file systems is a very very slow process. Because of this, it will take many years before all compute environment have upgraded to a fixed BeeGFS. In other words, we're gonna live with this problem for a long time.

Even if it technically is not a problem of fasterq-dump per se, AFAIK, it is the only software tool we have seen having this severe impact on our 12,000+ core environment with 1,000+ heterogeneous users (plus another 2,000+ core environment with 100+ users). Because of this, at a minimum, it would be extremely helpful if the maintainers of sratoolkit and fasterq-dump would consider adding means for sysadms to mitigate this problem. 🙏 My suggestion is to be able to control the defaults via environment variables.

@durbrow
Copy link
Collaborator

durbrow commented Oct 13, 2021

With the latest version of fasterq-dump (2.11.2), -t|--temp sets the location for all temporary files that it might generate. See CHANGES

There are means to specify configuration for the SRA Toolkit, both per user and for an installation. The documented means for users to edit their configuration is to use vdb-config -i. This updates their user configuration. For site-wide configuration, please use the contact information in the README for help setting it up.

@HenrikBengtsson
Copy link

HenrikBengtsson commented Oct 13, 2021

With the latest version of fasterq-dump (2.11.2), -t|--temp sets the location for all temporary files that it might generate. See CHANGES

That's brilliant. I didn't know that this was not already the case. So, if I understand this correctly, the "fasterq-dump: option -t sets directory of all temp files (including VDB cache)" means that in sratoolkit (< 2.11.2) the current working directory could still be hit by lots of temporary file I/O even if you directed -t/--temp to another drive.

If so, it could be (=I hope) that with sratoolkit (>= 2.11.2), there's no longer a need to limit the number of threads as a workaround. We'll try to run the tests we ran in the past where we could reproducibly crash compute host on BeeGFS. Hopefully, it'll work with the default --threads from now on.

There are means to specify configuration for the SRA Toolkit, both per user and for an installation. The documented means for users to edit their configuration is to use vdb-config -i. This updates their user configuration.

Unfortunately, from a sysadm point of view, asking users to edit their own config will only reach so many users.

For site-wide configuration, please use the contact information in the README for help setting it up.

Now, this is interesting. Before I go down that path, will this be possible to do also when users install their own version of sratoolkit, e.g. via Anaconda (which is unfortunately quite a common despite we already provide it via environment modules)?

A follow question for my understanding as a fellow developer: What's the reason for --temp not defaulting to a temp folder provided by the system (e.g. mktemp -d), which is often set up to provide maximum performance. Is it that temp folders are often on /tmp (unless TMPDIR says otherwise) and /tmp is often too small to host fasterq-dump temporary files?

@erhoppe
Copy link

erhoppe commented Oct 13, 2021

I heard back from my sysadmin and they and BeegFS were unable to reproduce the error so it seems unlikely that we'll have the fix from that end. From what they've seen, the issue happens when writing to any BeegFS file system. They've only been successful when writing to local or when using fastq-dump. They've limited fasterq-dump to a single thread so they haven't had more data for what's causing it.

@wraetz
Copy link
Contributor

wraetz commented Oct 14, 2021

Before release 2.11.2 there are 2 different temp-locations involved for fasterq-dump:
The location for temp. files the tool creates ( -t|--temp ), which defaults to the current working directory.
The location for temp. files the underlying library uses to cache remote-https-requests, which defaults to '/var/tmp' and can be changed via 'vdb-config -i'. ( or can be avoided completely by using prefetch and supplying fasterq-dump with the absolute path of the directory created by prefetch )

Now with release 2.11.2 fasterq-dump forces the underlying library to use the location given via ( -t|--temp) also. If the temp-location is not given, the previous 2 different default locations are still in use ( current working directory - and - '/var/tmp' ).

In hindsight it might have been better to default the -t|--temp location to the system-provided location - but we did not change it later, because we did not want to surprise current users.

@HenrikBengtsson
Copy link

HenrikBengtsson commented Oct 16, 2021

Thanks for clarifying - it's helpful information for sysadms who are trying to mitigate this problem - folks who probably are not familiar with this tool at all.

In hindsight it might have been better to default the -t|--temp location to the system-provided location - but we did not change it later, because we did not want to surprise current users.

I might come back and nudge you about this one later ;)

@liuxiawei
Copy link

I face the same problem in both version 2.11.0 and last version 3.0.0.

@durbrow durbrow closed this as completed Nov 14, 2022
@tjakobi
Copy link

tjakobi commented Dec 7, 2022

I am also still facing the described issues on a BeeGFS file system using both, 2.11.0 and 3.0.0. While the system does not enter the described "ps hangs" state always, it does so after a few successful runs of fasterq-dump.

Using fasterq-dump --threads 2 --outdir /beegfs/some/dir/ -t /dev/shm/ --split-files does result in lock-ups, although specifying a non-BeeGFS output directory seems to do the trick.

However, my cluster environment is using BeeGFS as main storage, as such there are no large local disk capacities to use, moreover users are still able to kill nodes if they specify wrong directories.

Is there any update on this?

Thank you!

@HenrikBengtsson
Copy link

@tjakobi , just in case someone (e.g. someone from sra-tools or BeeGFS) finds this later on and tries to troubleshoot it, what version of BeeGFS are you running? (I'm just trying to maximize the chances for this critical problem to be fixed)

@tjakobi
Copy link

tjakobi commented Dec 7, 2022

Hi @HenrikBengtsson, of course, good point. I'm running BeeGFS 7.3.1 on Ubuntu 20.04.5 LTS (with 5.4.0-131-generic Kernel).

@HenrikBengtsson
Copy link

Just learned that this issue has been addressed in BeeGFS 7.3.2 (2022-10-14). From the release notes:

  • Fixed a deadlock between mmap and read when both were operating on the same memory area. This should fix issues with multithreaded runs of fasterq-dump that have been reported by some users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants