-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fasterq-dump job hanged and killed multiple server #463
Comments
I now using https://github.com/rvalieris/parallel-fastq-dump which works like a charm |
I had an idea, could it be that fasterq-dump creates non-unique temp files that might be overwritten by multiple processes writing on the same files? |
fasterq-dump does create temporary files based on a tuple of machine-name, user-name and pid. In case of parallel-fastq-dump: the tool downloads the accession first, then runs multiple instances of fastq-dump in parallel on the local copy. Each instance with a slice of the full row-range. Then it concatenates the pieces. That is exactly what fasterq-dump is doing ( minus the prefetch ), in C instead of python. For unaligned accessions this will result in about the same speed. For aligned accessions however fasterq-dump will be faster, because there is an inner join to be made to assemble the FASTQ output. Fasterq-dump avoids this inner join by creating a lookup-table on disk as the first step. Because of that - fasterq-dump cannot be run on a slice of rows - each slice would result in a full table-scan and erase the speed advantage of the whole approach. What I would recommend for best speed on multiple servers: a combination of prefetch and fasterq-dump. Assuming you have downloaded an accession into a directory '/path/to/where/it/is/stored' More details are to be found here: https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump If you have the luxury of a RAM-disk ( aka /dev/shm ) that is big enough - make use of it as scratch-space, it is worth it speed-wise! |
Sorry if I wasn't clear. I used prefetch before. |
@SilasK, do you still need help? |
No, paralell-fastq-dump solved my problems, If you have any questions how faster-dump killed our server I can forward your questions.
|
Regarding how fasterq-dump and killed your server ( slurm-cluster):
I have created a small test-cluster with slurm - and I was not able to kill it with fasterq-dump.... |
It's part of a bigger pipeline, but If you have conda installed, I can help you to reproduce the error with ~5 commands. See indexes below:
|
We're experiencing the same problem where a user who calls Right now we're looking into mitigating this problem by forcing AFAIU, there is no way to control the number of threads that
Ask/wish: Could please consider adding support for overriding the default six (6) threads via an environment variable, e.g. Question: Another workaround we're considering is defining a wrapper Bash function Thxs |
Fasterq-dump has also necessitated reboots of our servers, and strangely of the home directory server despite my script writing the temporary files and outputs to directories on other servers (verified that they are indeed being written there) and running it on other nodes. I'm not the admin so I don't know the specifics (they said that "ps -ef" hanging was the most common issue), but It's very odd behavior. I've been using fasterq-dump because the ability to use multiple threads (typically use 12-18) makes it substantially faster at downloads than prefetch in my testing. However, the sysadmins are now disabling |
@erhoppe, thanks for sharing. Do you know if you are also have a BeeGFS file system (as others with problem reported elsewhere on this issue tracker)? If you don't know, try with: $ beegfs-ctl | head -1
BeeGFS Command-Line Control Tool (http://www.beegfs.com) or see if you have some other CLI BeeGFS tools, e.g. |
@HenrikBengtsson , none of those commands were found but I did find one reference to BeeGFS in the wiki from 2019 in reference to the file systems I am using with |
Update: BeeGFS is indeed what they're using for the scratch storage. According to our sysadmins, BeeGFS has found the bug and is working on a fix! (Apparently, other programs have caused the issue as well, though I expect this issue can be closed since it's not |
@erhoppe: Do you maybe have a link to the Beegfs bug ? |
@timeu , I'm afraid not. I believe they were corresponding with them through email, and I also haven't checked in with my sysadmins in a while since I've been working on other things. If you're having the same issue, let me know and I can reach out to them to see if they've heard anything. |
@erhoppe : Thanks for the quick response. Yes we are seeing the exact issue. |
The file system is assumed to be POSIX compliant for all Unix and Unix-like hosts. There isn't any general means for user programs to detect that a file system has bugs that it can't work with, particularly if the nature of the bug is to hang the process or the host. |
Hi (again). Ad-hoc workaround attemptOur current approach for an attempt to avoid this problem is to declare a Bash function fasterq-dump() {
command fasterq-dump --threads 2 --temp "$(mktemp -d)" "$@"
} The hope is that by running with minimal number of threads (= two, cf. ) and towards The above Bash function obviously only works in Bash. Sysadms need to do similar things for other shells they support on their systems. Caveat?It is not clear to me what settings the fasterq-dump --threads 4 --temp /path/to some thing which then end up calling fasterq-dump --threads 2 --temp "$(mktemp -d)" --threads 4 --temp /path/to some thing Will the WishInstead of the above hack, I wish it would be possible to control the defaults via environment variables. For example, setting: export FASTERQ_DUMP_THREADS=2
export FASTERQ_DUMP_TEMP=$(mktemp -d) globally would achieve the same as the above Bash function and it would regardlss of shell. A related wish, is that MotivationWe know compute hosts that run BeeGFS go down all the time around the world because of users using If it is a BeeGFS bug (has that actually been confirmed and reported?) that will eventually be fixed in BeeGFS, the process of upgrading parallel file systems is a very very slow process. Because of this, it will take many years before all compute environment have upgraded to a fixed BeeGFS. In other words, we're gonna live with this problem for a long time. Even if it technically is not a problem of |
With the latest version of There are means to specify configuration for the SRA Toolkit, both per user and for an installation. The documented means for users to edit their configuration is to use |
That's brilliant. I didn't know that this was not already the case. So, if I understand this correctly, the "fasterq-dump: option -t sets directory of all temp files (including VDB cache)" means that in sratoolkit (< 2.11.2) the current working directory could still be hit by lots of temporary file I/O even if you directed If so, it could be (=I hope) that with sratoolkit (>= 2.11.2), there's no longer a need to limit the number of threads as a workaround. We'll try to run the tests we ran in the past where we could reproducibly crash compute host on BeeGFS. Hopefully, it'll work with the default
Unfortunately, from a sysadm point of view, asking users to edit their own config will only reach so many users.
Now, this is interesting. Before I go down that path, will this be possible to do also when users install their own version of sratoolkit, e.g. via Anaconda (which is unfortunately quite a common despite we already provide it via environment modules)? A follow question for my understanding as a fellow developer: What's the reason for |
I heard back from my sysadmin and they and BeegFS were unable to reproduce the error so it seems unlikely that we'll have the fix from that end. From what they've seen, the issue happens when writing to any BeegFS file system. They've only been successful when writing to local or when using fastq-dump. They've limited fasterq-dump to a single thread so they haven't had more data for what's causing it. |
Before release 2.11.2 there are 2 different temp-locations involved for fasterq-dump: Now with release 2.11.2 fasterq-dump forces the underlying library to use the location given via ( -t|--temp) also. If the temp-location is not given, the previous 2 different default locations are still in use ( current working directory - and - '/var/tmp' ). In hindsight it might have been better to default the -t|--temp location to the system-provided location - but we did not change it later, because we did not want to surprise current users. |
Thanks for clarifying - it's helpful information for sysadms who are trying to mitigate this problem - folks who probably are not familiar with this tool at all.
I might come back and nudge you about this one later ;) |
I face the same problem in both version 2.11.0 and last version 3.0.0. |
I am also still facing the described issues on a BeeGFS file system using both, 2.11.0 and 3.0.0. While the system does not enter the described " Using However, my cluster environment is using BeeGFS as main storage, as such there are no large local disk capacities to use, moreover users are still able to kill nodes if they specify wrong directories. Is there any update on this? Thank you! |
@tjakobi , just in case someone (e.g. someone from sra-tools or BeeGFS) finds this later on and tries to troubleshoot it, what version of BeeGFS are you running? (I'm just trying to maximize the chances for this critical problem to be fixed) |
Hi @HenrikBengtsson, of course, good point. I'm running BeeGFS 7.3.1 on Ubuntu 20.04.5 LTS (with 5.4.0-131-generic Kernel). |
Just learned that this issue has been addressed in BeeGFS 7.3.2 (2022-10-14). From the release notes:
|
Dear NCBI developer.
I wanted to use the new fasterq-dum tool to download 1000 fastq files. As recommended I used prefetch to get a sra files. Then I tried to extract the fastq files with fasterq-dump. For this, I submitted multiple jobs on a slurm cluster to speed up the process.
However many of the jobs hanged why many clusters had to be rebooted. Do you have any idea why this could happen.
Hee is my script:
Here is a log
Slurm log:
The system administrator explained me the problem:
The text was updated successfully, but these errors were encountered: