Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fasterq-dump getting stuck in loops -- inconsistent and rectified when running fastq-dump #161

Open
magruca opened this issue Oct 24, 2018 · 18 comments

Comments

@magruca
Copy link

magruca commented Oct 24, 2018

Hi all--

I am working on a cluster that has a SLURM job handler and plan to download a few thousands SRAs for a project. I have developed a Nextflow pipeline to handle this load and was excited when you released the new version of sra-tools that facilitated multi-threading. However, I am sporadically crashing nodes when trying to use fasterq-dump. I noticed there were issues regarding fasterq-dump getting caught in loops, but I do believe this issue differs from others I had read.

For example, I was trying to run fasterq-dump on Andrysik2017 data SRR4090[098-109] with basic commands: fasterq-dump ${SAMPLE.sra} -e 8. Each time I would run fasterq-dump on these examples, different SRRs would get stuck in an endless loop and ultimatelycrash an entire node. This hasn't been consistent. Sometimes entire projects similar to this go through without a hitch, however I would say failures are more common than not over the past two weeks I have been running it. This is also significantly more common when I try to run multiple jobs in parallel, but again I could cancel all of those jobs and resubmit the same batch, and the SRRs that would proceed and those that get stuck would be different than those that did in the first submission.

I originally suspected this might be a problem with our cluster or Nextflow, however when I switched back to using fastq-dump, the issue was rectified which leads me to believe it's not on our end.

Thank you in advance for your time in troubleshooting and resolving this issue.

@wraetz
Copy link
Contributor

wraetz commented Oct 25, 2018

Fasterq-dump uses per default the current directory for temp. files. If you are running other jobs in parallel in the same directory they compete for space. You could try to use a different directory ( commandline-option -t|--temp ) for each job.

The speed can also be improved if the temp. directory is on a different file-system like a SSD or a RAM-disk.

I do not know the details of your cluster-management, but I know from others that they terminate jobs if a job exceeds limits. Fasterq-dump needs more temporary space than fastq-dump on disk and in memory. That is where the speed improvement comes from - not only from the usage of multiple threads. Maybe you have to adjust your job-settings.

@magruca
Copy link
Author

magruca commented Oct 25, 2018

Thank you for the quick reply.

Nextflow manages jobs such that each job will run in a separate temp directory, so I don't imagine that that's the issue. We have over a petabyte of space, so unfortunately I don't think that's the issue either.

Our cluster has 64 nodes with 128 cores each. I have been running these jobs requesting 8 cores with a memory limit of up to 200GB (I had suspected it might be a memory issue). I could increase this further, but given that I can run fastq-dump with 15-20GB without issue, I would again probably revert back to that. I have run mpstat on the job, and it doesn't ever seem to get close to exceeding the memory limit. However, this isn't an issue of speed. When it runs, it is extremely efficient. The issue arises when I try to run multiple fasterq-dump jobs at once only about 1/3-1/2 will actually run and the others get stuck (some even start and then get stuck) and the others just go on in a loop until they hit the time wall I set for them.

To add to this -- I never get an exit code indicating lack of space or memory allocation. The jobs never actually terminate.

@wraetz
Copy link
Contributor

wraetz commented Oct 25, 2018

I am very interested to find out the reason for getting stuck. So you are saying that you can process SRR-accessions with fasterq-dump without problems as long as it is just one at a time, but when you try that with multiples of them at the same time - some of them get stuck. Can you please tell me what getting stuck means. Are you running with the progress-option ( -p|--progress ) and the progress is stopped? As I am reading your comment - you are running multiples of them on different machines? But they influence each other somehow! How many of them are you running at once?

@magruca
Copy link
Author

magruca commented Oct 25, 2018

Correct -- and it's never the same SRR that gets "stuck".

By stuck I mean one of two things:

  1. The job appears to start (i.e a temp directory is created in my head directory) but then the job never appears to begin. I have a time cap on the jobs at 4 hours -- this is pretty extensive as even very large fastq files (100GB+) will download typically in <5 minutes using 8 cores
  2. The job starts (i.e. I see multiple temporary .fastq files reflecting the multi-treading in the temp directory) but then gets "stuck" in that they again will hit a wall at 4 hours

I do not get any error codes in either of these scenarios. Sometimes these jobs are split across different nodes and sometimes they are on the same node, but there does not seem to be a pattern as to which get stuck (e.g. if there are two running on the same node, sometimes those will both run, sometimes only one, sometimes neither). The node allocation just depends on job load on our cluster.

I'm typically trying to run 8 or less jobs at once. I've tried requesting fewer threads (between 4-8), too, and this does not seem to resolve the issue. Switching to fastq-dump has resolved this, so it seems to be something with the multi-threading, although admittedly I am assuming that is the only difference between the two. I'm just having trouble getting at what could be causing it. What's odd is even if it was a memory allocation issue, once the other jobs that started running and successfully complete, the "stuck" jobs do not pick up and begin running.

Thanks!

@wraetz
Copy link
Contributor

wraetz commented Oct 25, 2018

Did you prefetch the accessions or are you 'downloading' them with fasterq-dump/fastq-dump?

@wraetz
Copy link
Contributor

wraetz commented Oct 25, 2018

I have a suspicion that our servers a limiting the number of connections from a certain ip-address.
If you use fastq-dump, you are asking for just 1 connection per process.
If you are using fasterq-dump with 8 threads, there are 8 connections in parallel per process.
If you are using 8 fasterq-dump processes at the same time that makes for 64 connections...
However if you are using prefetch, that would be not the case.

@magruca
Copy link
Author

magruca commented Oct 25, 2018

I have been using prefetch to get the SRAs first. I don't want to store that many fastq files on our server (for obvious reasons), but it's nice to have the sras in case we want to re-process anything. Presumably though there is still some information stored on the server when I'm using fasterq-dump? I'm actually not aware if all information needed for the fastq is stored directly in the sra file.

@kwrodarmer
Copy link
Contributor

If you're using prefetch, no information will be left on NCBI servers.

@wraetz
Copy link
Contributor

wraetz commented Oct 25, 2018

Prefetch puts the SRA's into a special location to be found by the tools : /home/username/ncbi/public/sra
How do you make sure that the tool is using this location when running on the cluster?

@wraetz
Copy link
Contributor

wraetz commented Oct 25, 2018

If the SRA is aligned against the reference, it does need access to the reference-accessions. Prefetch downloads them into /home/username/ncbi/public/refseq. Does your cluster-node have access to that too?

@magruca
Copy link
Author

magruca commented Oct 25, 2018

I redirected the output of the SRAs to a /scratch directory. I did this by creating a .ncbi/ directory in my home directory and making a user-settings.mkfg file with the following line:

/repository/user/main/public/root = "/scratch/<my defined sra output directory>/"

I'm then running fastq-dump/fasterq-dump on SRAs in this repository.

@kwrodarmer
Copy link
Contributor

@magruca - are you still experiencing issues?

@magruca
Copy link
Author

magruca commented Oct 31, 2018

Yes -- I ultimately chose to switch back to using the parallel-fastq-dump wrapper in the interim as I'm not experiencing any issues with that.

@klymenko
Copy link
Contributor

@magruca, do you still need help?
We released 2.11.0 and made lots of improvements since your report.

@jorgeboucas
Copy link

jorgeboucas commented Mar 26, 2021

we are seeing this in 2.11.0.

when it hangs it is not even possible to do Ctrl+C.

we are investigating this from our side and will let you know what we come up with.

this is running over baremetal on a singularity container with a mounted beegfs file system - so far we do not see problems when changing to the hosts /tmp as the cwd.

@durbrow
Copy link
Collaborator

durbrow commented Mar 26, 2021

We have seen issues before with fasterq-dump and beegfs; fasterq-dump does I/O from multiple threads, I suspect (I have no access to an installation of beegfs to verify) the file system driver doesn't like that and deadlocks. (A process can't respond to a signal when it is in kernel space, that why Ctrl+C doesn't work.)

As an aside, fasterq-dump uses a lot of temporary files, it is probably best to create these on a locally attached device. You can set the location with -t|--temp but the default is the cwd. That is probably why it works when you cd.

@HenrikBengtsson
Copy link

Interesting; we're also on BeeGFS and experiencing this. We're trying to mitigate this problem in different ways. All the ideas we have now is to write wrappers for fasterq-dump that set "better" defaults to avoid these host crashes. In the same spirit as I wrote #463 (comment), it would be super neat if one could override the default value of CLI option --temp via an environment variable, e.g. FASTERQ_DUMP_TEMP. That way we could set:

export FASTERQ_DUMP_TEMP=/fast/big/local/scratch/tempfolder

globally to lower the risk for running into these problems. This would probably also benefit the end-user, since they'll work toward a much faster local disk than a shared global parallel file system.

@HenrikBengtsson
Copy link

Just learned that this issue has been addressed in BeeGFS 7.3.2 (2022-10-14). From the release notes:

  • Fixed a deadlock between mmap and read when both were operating on the same memory area. This should fix issues with multithreaded runs of fasterq-dump that have been reported by some users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants