Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work directories not clearing after jobs complete, causing maxed out vm storage before pipeline completes #3838

Closed
jstnchou opened this issue Apr 5, 2023 · 3 comments

Comments

@jstnchou
Copy link

jstnchou commented Apr 5, 2023

Bug report

Hello,

I’m trying to download some fastq files from SRA using SRA-Toolkit and place them into an azure storage. However, it appears that the working directories of jobs are not being cleared after the jobs finish.

Expected behavior and actual behavior

With SRA-Toolkit, .sra files are first prefetched and then those files are used to download the fastq files for the same SRA ID. I have a step in my script that I run in my pipeline that deletes the directory containing all the .sra and other intermediate files, but since I’m moving the fastq files to azure storage in my process, I don’t have anything in my script to get rid of those afterwards, and I also assumed that the working directories of the nodes would be cleared after the jobs complete.

However, when attempting to download the fastqs from a manifest of 100 or more SRA IDs, the pipeline would periodically shut down, providing an error that the fastq file for a given ID cannot be found (to move to azure storage). On azure it also shows nodes as unusable when this happens. If I revise my manifest file to exclude the SRA IDs that I’ve already successfully pulled and then rerun the pipeline, it runs just fine. So that leads me to believe that the fastqs are not being deleted, thus piling up and taking up storage such that new fastq’s cannot be downloaded and thus not found when trying to move them to azure. This occurs even when I’m using a Standard_D32_v3 vm type and have the cleanup setting set to true. Although rerunning the pipeline with an updated manifest will eventually get the job, it’s rather inconvenient and inefficient, so I’m wondering if there was something I’m missing or if this is some sort of bug. Thanks!

Steps to reproduce the problem

I’ve included a tar.gz example pipeline to provide my workflows, processes, script, and nextflow.config. Any personal keys or information have been removed from the config and replaced with blank strings. Keep in mind I've changed the vm type to a D2 in this example pipeline in order to make it fail faster. I've also included a sample manifest, though I'm currently having a bit of trouble with getting the exact same error as before.

It's noteworthy that in this tarred pipeline I'm not trying to pull SRA files that require an ngc key to access, and before I was running into the aforementioned issue when trying to pull IDs that do require a key. But as mentioned, restarting the pipeline every time it fails resulted in gradual success, so I don't think the key is the issue for reaching max capacity in the nodes. At the very least, my config and workflow can demonstrate if I'm doing anything that could lead to the working directories filling up.

example_manifest.txt
example_sra_call_fail.tar.gz

Command to run pipeline:

nextflow run example_sra_call_fail -profile az --manifest example_manifest.txt --output_folder az://test

Program output

The following is the error I would get when my original pipeline would break due to reaching max capacity and no longer being able to download new fastq files to send to azure storage (not able to be properly formatted as code block):

Error executing process > 'sra_pull:sra_pull_process (53)'

Caused by:
  Missing output file(s) `*.gz` expected by process `sra_pull:sra_pull_process (53)`

Command executed [/home/ljl/sra_call/templates/sra_pull.sh]:

  #!/usr/bin/env bash

  echo "Pulling sra file with following SRA accession from dbGaP database: " SRR1312784


  prefetch --ngc sra_key.ngc SRR1312784


  echo "Finished downloading sra file with following accession: " SRR1312784


  echo "Output file(s):"

  ls SRR1312784


  fastq-dump --split-files SRR1312784


  gzip -f *.fastq

  rm -rf SRR1312784

Command exit status:
  0

Command output:
  Pulling sra file with following SRA accession from dbGaP database:  SRR1312784

  Finished downloading sra file with following accession:  SRR1312784
  Output file(s):

Command error:
  2023-04-05T00:03:15 prefetch.3.0.3: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
  ls: SRR1312784: No such file or directory
  Failed to call external services.
  gzip: *.fastq: No such file or directory

Work dir:
  az://work/9b/0cb2804f7287e49aa04199b1bc0683

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

Environment

  • Nextflow version: 22.04.5 build 5708
  • Java version: openjdk version "11.0.18"
  • Operating system: WSL Linux
  • Bash version: 5.0.17(1)-release
@jstnchou jstnchou changed the title Work directories not being cleared after jobs complete, resulting in maxing out on vm storage before pipeline completes Work directories not clearing after jobs complete, causing maxed out vm storage before pipeline completes Apr 5, 2023
@bentsherman
Copy link
Member

bentsherman commented Apr 6, 2023

Automatic cleanup is a popular topic around here. Welcome to the club.

The cleanup option currently doesn't work correctly with cloud storage because of a bug, but there is a fix in review: #3836.

There is also a discussion to make the cleanup option delete these files during the pipeline execution rather than at the end: #452. Also a PR in progress: #3849.

@stale
Copy link

stale bot commented Sep 17, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 17, 2023
@bentsherman
Copy link
Member

Closing as duplicate of #452

@bentsherman bentsherman closed this as not planned Won't fix, can't repro, duplicate, stale Sep 17, 2023
@stale stale bot removed the stale label Sep 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants