Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Fusion file system to Slurm and LSF executors #3516

Merged
merged 10 commits into from Feb 6, 2023

Conversation

pditommaso
Copy link
Member

This PR adds support for Fusion file system to Slurm and LSF grid executors.

The PR implements the following changes

  • Mark the GridTaskHandler as FusionAwareTask
  • Use the NXF_CHDIR variable as the common pattern to specify the job work directory instead of relying on grid-specific directives
  • When fusion execution is required, submit the job execution by creating an inline job wrapper piped via stdin special file, instead of creating a temporary launcher file

This feature for production usage requires #3513 or #3514

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
@pditommaso pditommaso added this to the 23.04.0 milestone Dec 26, 2022
@pditommaso pditommaso marked this pull request as draft December 26, 2022 15:00
@bentsherman bentsherman changed the title Add support for Fusion file system to Slum and LSF executors Add support for Fusion file system to Slurm and LSF executors Jan 5, 2023
@bentsherman
Copy link
Member

Looks good so far 👍

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
@pditommaso pditommaso marked this pull request as ready for review February 4, 2023 21:01
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
@pditommaso
Copy link
Member Author

Think the Fusion support is ready to go. Main changes:

  • the task work dir is defined via NXF_CHDIR variable instead of relying on batch scheduler directive
  • Batch scheduler log is sent to /dev/null to prevent writing .command.log in the local fs
  • Improved submit error troubleshooting including the actual submit command in the error message
  • Add support for Grid Engine other than Slurm and LSF

Copy link
Collaborator

@jordeu jordeu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested and it's really nice, it's magic! 😄

The only thing that was not intuitive to me is the need to explicitly set fusion.exportAwsAccessKeys = true to make it work.

I'm thinking that with the upcoming support for more data stores this will became a bit of a mess to always explicitly configure and maybe we can have a default way of sharing credentials (using standard env vars) that works out of the box and only when you use a different approach you need to tune your configuration.

@pditommaso
Copy link
Member Author

The only thing that was not intuitive to me is the need to explicitly set fusion.exportAwsAccessKeys = true to make it work.

I agree that's not great but how to handle it better? In principle storage authentication should be delegated to the corresponding cloud infra.

When using Fusion on-prem I'm expecting the use of S3-compatible storage such as Minio or Ceph. Don't think in this scenario the use of cloud-hosted object storage is a real use case.

@jordeu
Copy link
Collaborator

jordeu commented Feb 5, 2023

MinIO and CEPH also use AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to manage the authentication. So on 99% of the use cases when using Grid platforms + S3 the user will need to pass AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_S3_ENDPOINT to the container environment.

@pditommaso pditommaso merged commit fbc95e6 into master Feb 6, 2023
@pditommaso pditommaso deleted the fusion-grid branch February 6, 2023 15:10
@pditommaso
Copy link
Member Author

Included in version 23.02.0-edge

@JosephLalli
Copy link

JosephLalli commented Feb 28, 2023

So great to see this, thank you all!

Question for you @pditommaso: Another approach to supporting Fusion in grid-based executors (#3205) supported a wider array of grid-based executors. I am at an institution with a condor-based HTC setup.

Does this PR allow for the use of Fusion in any executors beyond Slurm and LSF? If so, what limits it to those two executors, and what work would need to be done (by me) to allow the use of Fusion with condor? If it matters, I intend to use it with our on-premises Minio-based s3-compatible storage.

@pditommaso
Copy link
Member Author

Currently, it is not supported, but in principle, it should be easy to add it.

To enable this capability what nextflow does it to pipe the launch command to the summit command stdin instead of creating a wrapper script.

For example when using Slurm instead of running

sbatch .command.run

it does

cat .command.run | sbatch

Does Condor allow the same?

@bentsherman
Copy link
Member

According to my research it does support stdin.

Also some notes here that may be relevant to Condor support.

@pditommaso
Copy link
Member Author

I guess so, it's needed somebody that takes care of implementing and validating it

@bentsherman
Copy link
Member

@JosephLalli if I draft a Fusion / Condor PR, can you test it in your environment? We don't have a Condor environment so makes it hard to develop new features for it. As long as you can run Nextflow, you should be able to build and test it from a branch.

@JosephLalli
Copy link

JosephLalli commented Feb 28, 2023 via email

@bentsherman
Copy link
Member

Excellent, I will tag you when we have a PR for it. We'll just need you to run a small pipeline (e.g. rnaseq-nf) on your cluster with Fusion.

I also don't know if Fusion works with S3-compatible storage on HPC yet, or at least it hasn't been tested. So we might be waiting for that support to arrive.

@pditommaso
Copy link
Member Author

Yes, it does. This an example config

aws {
  client {
    endpoint = '<MINIO ENDPOINT>'
    s3PathStyleAccess = true
  }
  accessKey = '<ACCESS>'
  secretKey = '<SECRET>'
}

wave { enabled = true }

fusion { enabled = true; exportAwsAccessKeys = true }

abhi18av pushed a commit to abhi18av/nextflow that referenced this pull request Oct 28, 2023
…ow-io#3516)

This commit adds the support for Fusion file system to Slurm, LSF and 
Grid Engine batch schedulers  

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants