Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline execution hang when native task fail to be submitted #2060

Closed
drpatelh opened this issue Apr 23, 2021 · 18 comments
Closed

Pipeline execution hang when native task fail to be submitted #2060

drpatelh opened this issue Apr 23, 2021 · 18 comments
Milestone

Comments

@drpatelh
Copy link
Contributor

drpatelh commented Apr 23, 2021

I am testing the nf-core/rnaseq pipeline via Tower on AWS and it appears that native Groovy processes that should only be executed once are somehow caught up in some sort of recursion that keeps on spawning more and more jobs. As you can see in the screenshot below I killed the pipeline execution after 2144 jobs were spawned for that single process.

image (4)

Running the command locally and via Github Actions works perfectly fine and only spawns a single process:

nextflow run nf-core/rnaseq -profile test -r dev

You should be able to reproduce this in Tower with an AWS set-up:

image

The process is called in the main workflow here and the module file for the process is here.

I have another workflow in the pipeline that uses a native Groovy process and I observed the same issue.

If it's something quite low-level I am happy to try and find an alternative solution so we can get the pipeline out :)

Thanks a bunch!

@pditommaso
Copy link
Member

mamma mia .. which is the 2.1k the failing task?

@drpatelh
Copy link
Contributor Author

The module file for the native Groovy process is here. It will always only just takes a single GTF file so no idea why so many jobs are spawned. It is called in the main workflow here.

As I mentioned above, this happens with a completely unrelated native Groovy process too so some sort of pattern there.

Let me know if you want me to test anything else.

@pditommaso
Copy link
Member

pditommaso commented Apr 23, 2021

I mean, there are 2.1k failed tasks, what's the name of the tasks failing? ATTRIBUTE_IN_GTF says 0 of 1

@drpatelh
Copy link
Contributor Author

Yup, it says 0 of 1 but when you hover over that particular task in Tower it shows Pending:1 Failed: 2144 as in the first image? So I would assume the failed tasks are as a result of that particular process?

@drpatelh
Copy link
Contributor Author

image

@pditommaso
Copy link
Member

@drpatelh
Copy link
Contributor Author

Alt Text

@pditommaso
Copy link
Member

Not able to replicate. Do you have nextflow log file? you should be able to download from the tower UI.

@drpatelh
Copy link
Contributor Author

Hmmm...not there. Maybe because I deleted the results directory? Will try to reproduce and share the execution with you via Tower.

image

@drpatelh
Copy link
Contributor Author

Ok. Was able to reproduce and shared the run with you via Tower. So what is weird is that NF raises an error but still carries on submitting more and more jobs 🤔

nextflow.log

Uploading local `bin` scripts folder to s3://nf-core-awsmegatests/rnaseq/dev/work/tmp/e9/13da243989d816fbca30b9b8e80979/bin
WARN: Process 'SRA_DOWNLOAD:SRA_TO_SAMPLESHEET' cannot be executed by 'awsbatch' executor -- Using 'local' executor instead
WARN: Local executor only supports default file system -- Check work directory: s3://nf-core-awsmegatests/rnaseq/dev/work
Monitor the execution with Nextflow Tower using this url https://tower.nf/watch/2BhBHVOqsYejv0
[44/7513ef] Submitted process > SRA_DOWNLOAD:SRA_IDS_TO_RUNINFO (SRR11140746)
[56/41fd5d] Submitted process > SRA_DOWNLOAD:SRA_IDS_TO_RUNINFO (SRR11140744)
[37/27990b] Submitted process > SRA_DOWNLOAD:SRA_RUNINFO_TO_FTP (1)
[26/e62c35] Submitted process > SRA_DOWNLOAD:SRA_RUNINFO_TO_FTP (2)
[c2/33ed2f] Submitted process > SRA_DOWNLOAD:SRA_FASTQ_FTP (SRX7777164_T1)
[d3/ec9f21] Submitted process > SRA_DOWNLOAD:SRA_FASTQ_FTP (SRX7777166_T1)
Error executing process > 'SRA_DOWNLOAD:SRA_TO_SAMPLESHEET (SRX7777164_T1)'
Caused by:
  Process requirement exceed available memory -- req: 6 GB; avail: 1 GB

@pditommaso
Copy link
Member

The error is raised because there's a requirement (global?) of 6GB, instead the task being native run on the head node that only has 1.GB. Not sure what's happening, then ..

@pditommaso
Copy link
Member

Ok, I'm able to replicate. Now I need to find a solution 😬

@pditommaso pditommaso changed the title Native Groovy processes spawn endless jobs on AWS Pipeline execution hang when native task fail to be submitted Apr 25, 2021
@pditommaso
Copy link
Member

Ok, pushed a patch. Thanks for reporting.

@pditommaso pditommaso added this to the v21.04.0 milestone Apr 25, 2021
@drpatelh
Copy link
Contributor Author

Thank you!

@pditommaso
Copy link
Member

Note that the causing problem was the excess memory request

@drpatelh
Copy link
Contributor Author

Yup. What is the best way to customise this via Tower? Do we need to change the default head node we are using?

@pditommaso
Copy link
Member

it's better to limit request for native (ie local) tasks, makes no sense to use 6 GB for that.

@pditommaso
Copy link
Member

ps. the amount of mem for the Tower head job can be set in the compute env settings "Head Job memory" under Advanced settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants