fasterq-dump getting stuck in loops -- inconsistent and rectified when running fastq-dump #161

magruca · 2018-10-24T15:57:09Z

Hi all--

I am working on a cluster that has a SLURM job handler and plan to download a few thousands SRAs for a project. I have developed a Nextflow pipeline to handle this load and was excited when you released the new version of sra-tools that facilitated multi-threading. However, I am sporadically crashing nodes when trying to use fasterq-dump. I noticed there were issues regarding fasterq-dump getting caught in loops, but I do believe this issue differs from others I had read.

For example, I was trying to run fasterq-dump on Andrysik2017 data SRR4090[098-109] with basic commands: fasterq-dump ${SAMPLE.sra} -e 8. Each time I would run fasterq-dump on these examples, different SRRs would get stuck in an endless loop and ultimatelycrash an entire node. This hasn't been consistent. Sometimes entire projects similar to this go through without a hitch, however I would say failures are more common than not over the past two weeks I have been running it. This is also significantly more common when I try to run multiple jobs in parallel, but again I could cancel all of those jobs and resubmit the same batch, and the SRRs that would proceed and those that get stuck would be different than those that did in the first submission.

I originally suspected this might be a problem with our cluster or Nextflow, however when I switched back to using fastq-dump, the issue was rectified which leads me to believe it's not on our end.

Thank you in advance for your time in troubleshooting and resolving this issue.

wraetz · 2018-10-25T14:58:52Z

Fasterq-dump uses per default the current directory for temp. files. If you are running other jobs in parallel in the same directory they compete for space. You could try to use a different directory ( commandline-option -t|--temp ) for each job.

The speed can also be improved if the temp. directory is on a different file-system like a SSD or a RAM-disk.

I do not know the details of your cluster-management, but I know from others that they terminate jobs if a job exceeds limits. Fasterq-dump needs more temporary space than fastq-dump on disk and in memory. That is where the speed improvement comes from - not only from the usage of multiple threads. Maybe you have to adjust your job-settings.

magruca · 2018-10-25T15:14:47Z

Thank you for the quick reply.

Nextflow manages jobs such that each job will run in a separate temp directory, so I don't imagine that that's the issue. We have over a petabyte of space, so unfortunately I don't think that's the issue either.

Our cluster has 64 nodes with 128 cores each. I have been running these jobs requesting 8 cores with a memory limit of up to 200GB (I had suspected it might be a memory issue). I could increase this further, but given that I can run fastq-dump with 15-20GB without issue, I would again probably revert back to that. I have run mpstat on the job, and it doesn't ever seem to get close to exceeding the memory limit. However, this isn't an issue of speed. When it runs, it is extremely efficient. The issue arises when I try to run multiple fasterq-dump jobs at once only about 1/3-1/2 will actually run and the others get stuck (some even start and then get stuck) and the others just go on in a loop until they hit the time wall I set for them.

To add to this -- I never get an exit code indicating lack of space or memory allocation. The jobs never actually terminate.

wraetz · 2018-10-25T15:57:47Z

I am very interested to find out the reason for getting stuck. So you are saying that you can process SRR-accessions with fasterq-dump without problems as long as it is just one at a time, but when you try that with multiples of them at the same time - some of them get stuck. Can you please tell me what getting stuck means. Are you running with the progress-option ( -p|--progress ) and the progress is stopped? As I am reading your comment - you are running multiples of them on different machines? But they influence each other somehow! How many of them are you running at once?

magruca · 2018-10-25T16:15:43Z

Correct -- and it's never the same SRR that gets "stuck".

By stuck I mean one of two things:

The job appears to start (i.e a temp directory is created in my head directory) but then the job never appears to begin. I have a time cap on the jobs at 4 hours -- this is pretty extensive as even very large fastq files (100GB+) will download typically in <5 minutes using 8 cores
The job starts (i.e. I see multiple temporary .fastq files reflecting the multi-treading in the temp directory) but then gets "stuck" in that they again will hit a wall at 4 hours

I do not get any error codes in either of these scenarios. Sometimes these jobs are split across different nodes and sometimes they are on the same node, but there does not seem to be a pattern as to which get stuck (e.g. if there are two running on the same node, sometimes those will both run, sometimes only one, sometimes neither). The node allocation just depends on job load on our cluster.

I'm typically trying to run 8 or less jobs at once. I've tried requesting fewer threads (between 4-8), too, and this does not seem to resolve the issue. Switching to fastq-dump has resolved this, so it seems to be something with the multi-threading, although admittedly I am assuming that is the only difference between the two. I'm just having trouble getting at what could be causing it. What's odd is even if it was a memory allocation issue, once the other jobs that started running and successfully complete, the "stuck" jobs do not pick up and begin running.

Thanks!

wraetz · 2018-10-25T18:43:06Z

Did you prefetch the accessions or are you 'downloading' them with fasterq-dump/fastq-dump?

wraetz · 2018-10-25T18:47:21Z

I have a suspicion that our servers a limiting the number of connections from a certain ip-address.
If you use fastq-dump, you are asking for just 1 connection per process.
If you are using fasterq-dump with 8 threads, there are 8 connections in parallel per process.
If you are using 8 fasterq-dump processes at the same time that makes for 64 connections...
However if you are using prefetch, that would be not the case.

magruca · 2018-10-25T19:27:43Z

I have been using prefetch to get the SRAs first. I don't want to store that many fastq files on our server (for obvious reasons), but it's nice to have the sras in case we want to re-process anything. Presumably though there is still some information stored on the server when I'm using fasterq-dump? I'm actually not aware if all information needed for the fastq is stored directly in the sra file.

kwrodarmer · 2018-10-25T19:33:20Z

If you're using prefetch, no information will be left on NCBI servers.

wraetz · 2018-10-25T19:50:08Z

Prefetch puts the SRA's into a special location to be found by the tools : /home/username/ncbi/public/sra
How do you make sure that the tool is using this location when running on the cluster?

wraetz · 2018-10-25T19:52:51Z

If the SRA is aligned against the reference, it does need access to the reference-accessions. Prefetch downloads them into /home/username/ncbi/public/refseq. Does your cluster-node have access to that too?

magruca · 2018-10-25T19:55:57Z

I redirected the output of the SRAs to a /scratch directory. I did this by creating a .ncbi/ directory in my home directory and making a user-settings.mkfg file with the following line:

/repository/user/main/public/root = "/scratch/<my defined sra output directory>/"

I'm then running fastq-dump/fasterq-dump on SRAs in this repository.

kwrodarmer · 2018-10-31T12:22:24Z

@magruca - are you still experiencing issues?

magruca · 2018-10-31T15:23:53Z

Yes -- I ultimately chose to switch back to using the parallel-fastq-dump wrapper in the interim as I'm not experiencing any issues with that.

klymenko · 2021-03-17T20:23:26Z

@magruca, do you still need help?
We released 2.11.0 and made lots of improvements since your report.

jorgeboucas · 2021-03-26T10:54:45Z

we are seeing this in 2.11.0.

when it hangs it is not even possible to do Ctrl+C.

we are investigating this from our side and will let you know what we come up with.

this is running over baremetal on a singularity container with a mounted beegfs file system - so far we do not see problems when changing to the hosts /tmp as the cwd.

durbrow · 2021-03-26T15:11:12Z

We have seen issues before with fasterq-dump and beegfs; fasterq-dump does I/O from multiple threads, I suspect (I have no access to an installation of beegfs to verify) the file system driver doesn't like that and deadlocks. (A process can't respond to a signal when it is in kernel space, that why Ctrl+C doesn't work.)

As an aside, fasterq-dump uses a lot of temporary files, it is probably best to create these on a locally attached device. You can set the location with -t|--temp but the default is the cwd. That is probably why it works when you cd.

HenrikBengtsson · 2021-04-21T20:23:05Z

Interesting; we're also on BeeGFS and experiencing this. We're trying to mitigate this problem in different ways. All the ideas we have now is to write wrappers for fasterq-dump that set "better" defaults to avoid these host crashes. In the same spirit as I wrote #463 (comment), it would be super neat if one could override the default value of CLI option --temp via an environment variable, e.g. FASTERQ_DUMP_TEMP. That way we could set:

export FASTERQ_DUMP_TEMP=/fast/big/local/scratch/tempfolder

globally to lower the risk for running into these problems. This would probably also benefit the end-user, since they'll work toward a much faster local disk than a shared global parallel file system.

HenrikBengtsson · 2023-02-14T00:25:19Z

Just learned that this issue has been addressed in BeeGFS 7.3.2 (2022-10-14). From the release notes:

Fixed a deadlock between mmap and read when both were operating on the same memory area. This should fix issues with multithreaded runs of fasterq-dump that have been reported by some users.

Miserlou mentioned this issue Nov 27, 2018

Gives Clients a Cronjob to Kill Hung Salmon (fasterq-dump) Jobs AlexsLemonade/refinebio#872

Merged

jaclyn-taroni mentioned this issue Nov 28, 2018

Route jobs away from fasterq-dump ? AlexsLemonade/refinebio#873

Closed

4 tasks

descostesn mentioned this issue Apr 18, 2020

error on multiple jobs #312

Closed

This was referenced Apr 21, 2021

fasterq-dump not respecting --threads? #494

Open

SRA Toolkit: Workaround to avoid crashing host due to BeeGFS overload HenrikBengtsson/CBI-software#17

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fasterq-dump getting stuck in loops -- inconsistent and rectified when running fastq-dump #161

fasterq-dump getting stuck in loops -- inconsistent and rectified when running fastq-dump #161

magruca commented Oct 24, 2018 •

edited

Loading

wraetz commented Oct 25, 2018

magruca commented Oct 25, 2018 •

edited

Loading

wraetz commented Oct 25, 2018

magruca commented Oct 25, 2018

wraetz commented Oct 25, 2018

wraetz commented Oct 25, 2018

magruca commented Oct 25, 2018

kwrodarmer commented Oct 25, 2018

wraetz commented Oct 25, 2018

wraetz commented Oct 25, 2018

magruca commented Oct 25, 2018 •

edited

Loading

kwrodarmer commented Oct 31, 2018

magruca commented Oct 31, 2018

klymenko commented Mar 17, 2021

jorgeboucas commented Mar 26, 2021 •

edited

Loading

durbrow commented Mar 26, 2021 •

edited

Loading

HenrikBengtsson commented Apr 21, 2021

HenrikBengtsson commented Feb 14, 2023

fasterq-dump getting stuck in loops -- inconsistent and rectified when running fastq-dump #161

fasterq-dump getting stuck in loops -- inconsistent and rectified when running fastq-dump #161

Comments

magruca commented Oct 24, 2018 • edited Loading

wraetz commented Oct 25, 2018

magruca commented Oct 25, 2018 • edited Loading

wraetz commented Oct 25, 2018

magruca commented Oct 25, 2018

wraetz commented Oct 25, 2018

wraetz commented Oct 25, 2018

magruca commented Oct 25, 2018

kwrodarmer commented Oct 25, 2018

wraetz commented Oct 25, 2018

wraetz commented Oct 25, 2018

magruca commented Oct 25, 2018 • edited Loading

kwrodarmer commented Oct 31, 2018

magruca commented Oct 31, 2018

klymenko commented Mar 17, 2021

jorgeboucas commented Mar 26, 2021 • edited Loading

durbrow commented Mar 26, 2021 • edited Loading

HenrikBengtsson commented Apr 21, 2021

HenrikBengtsson commented Feb 14, 2023

magruca commented Oct 24, 2018 •

edited

Loading

magruca commented Oct 25, 2018 •

edited

Loading

magruca commented Oct 25, 2018 •

edited

Loading

jorgeboucas commented Mar 26, 2021 •

edited

Loading

durbrow commented Mar 26, 2021 •

edited

Loading