Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ideas around automating container module installation by Nextflow #2

Open
marcodelapierre opened this issue Feb 11, 2022 · 8 comments

Comments

@marcodelapierre
Copy link
Owner

marcodelapierre commented Feb 11, 2022

Expanding on @vsoch comment in #1 :

And one question I'm thinking - should the workflow be able to install and load the modules? Akin to how the others pull the containers? or is that too much?

First of all, thanks for starting this conversation, sounds very interesting to me!

Let me try and add further aspects.

  1. It would be super cool if Nextflow were able to install container modules by itself!

  2. Current behaviour: if a requested module is not found, exit with error.

  3. Desirable, if module not found:
    a. look for SHPC availability in the shell environment
    b. look for recipe availability of the needed module (shpc show)
    c. shpc install the container
    d. attempt the module load again
    e. if any of these fail, exit with appropriate error message.

[UPDATE]: ignore the following bit

I can see one main failure point that needs attention, which is...kind of my fault LOL.

By direct experience, when I expose container modules via Lmod/EnvMod, I kind of use a "namespace" trick:
most of our deployed container modules are `quay.io/biocontainers/<tool>/<version>.  
As I don't want users to have to always type the `quay.io/biocontainers` prefix, I go with `module use $module_dir/quay.io/biocontainers`, so that users only do `module load <tool>/<version>`.

Now, whenever this is done, you lose the correspondence between the module name, and the entry in the SHPC registry...can we robustly provide this information to SHPC?

Maybe ... all we need is a bit of extra intelligence in `shpc show`, to enable it to try and look for recipes with the information passed by Nextflow? This should include the version tag, as well as the possibility of missing prefix...? 
And maybe have a special `shpc show` flag to enable this behaviour?

If we can think about this, then we may be able to knock at Nextflow's door and discuss support for SHPC in their codebase! (see above)
@marcodelapierre
Copy link
Owner Author

marcodelapierre commented Feb 11, 2022

Another thing I want to do is to ping them to show them the demo pipeline!

... but not now, my brain is completely boiled!

Be back on Monday

@marcodelapierre
Copy link
Owner Author

marcodelapierre commented Feb 11, 2022

Adding some more reflections.

First: forget about my "digression" above on the "namespace" trick, "shpc show" and so on, I think I was going off track.

So...I really think yours is a great idea, worth developing and bringing to the attention of Nextflow.

Some context for my thinking. I always mention **improving user friendliness/reducing usage barrier** as my favourite advantage of SHPC, but there's actually at least another equally cool one, which is **structuring and tidying up containers deployments**.
What I mean by this? By direct experience, when using Singularity you often end up with unwanted duplicates of the same SIF image, spread across the filesystem: the same user might be downloading the same images during different workflows, and in a shared systems distinct users maybe be using the very same container without knowing.
Now, SHPC successfully tackles this by enabling, for instance:

  1. central managed installations of containers, that can be used by multiple users
  2. even for a single user, a convenient way to store and access containers from a single location, without duplicates

Why am I telling all this?
Well, Nextflow already takes good care of usage barrier/user experience for containers, by abstracting away quite a few aspects of their management (similar to my first point on SHPC above).
However, it does not completely solve the second aspect, of having cleaned, rationalised container installations.

That can be a good advantage in promoting container modules vs simple containers in Nextflow...
...and then, here comes to play the feature you were mentioning, that is the ability to automatically install missing container modules!

Again, I can think of at least two slightly distinct usage scenarios:

  1. centralised container modules by sys admins: here Nextflow already works, as this demo pipeline shows
  2. container modules by users: here we probably need to code a new feature in Nextflow, but here there are at least two advantages, compared to how Nextflow currently works:
    I. avoiding caching the same containers in multiple place, store it once as container module
    II. once the Nextflow installs the container module, it also becomes available to the user for standard, non-Nextflow usage via the shell

Now, slightly repeating my comment above, with more context.
The feature could be something like modules.auto_install_shpc, and can disabled or enabled.
When enabled, it comes into action when a module is requested but not found (copy-pasting from above):
a. look for SHPC availability in the shell environment
b. attempt shpc install of the requested module
d. attempt the module load again
d. if any of these fail, exit with appropriate error message.

One thing that might be needed on the SHPC side, is appropriate exit codes/error message, depending on the way SHPC can fail, including

  • requested module not found
  • requested container could not be pulled
  • requested modulefile could not be created
  • shpc executable failing for generic reason (i.e. missing python dependency)
  • others...?

Then, Nextflow can pick up these exit codes and print appropriate error messages (or, it can just lazily print whatever error message SHPC produces).
Other errors to catch for Nextflow (not SHPC) include:

  • shpc executable not found
  • module not found (prompting to double check MODULEPATH vs SHPC module installation path)

And as I said, I am happy to start pinging the Nextflow devs by showing them this demo pipeline, and the beauty of being able to provide software in a reproducible way in as many as 3 different ways, including the latest and greatest...SHPC!

Well, that's really all from me for today... hope this can prompt further conversation!

@marcodelapierre
Copy link
Owner Author

@vsoch @pditommaso have a look at my thoughts in the last point above, on possible further interaction between Nextflow and Singularity-HPC (SHPC)

@vsoch
Copy link
Contributor

vsoch commented Feb 14, 2022

It's really fun to think about these ideas! Here are some thoughts.

advantage of SHPC, but there's actually at least another equally cool one, which is structuring and tidying up containers deployments. What I mean by this? By direct experience, when using Singularity you often end up with unwanted duplicates of the same SIF image, spread across the filesystem: the same user might be downloading the same images during different workflows, and in a shared systems distinct users maybe be using the very same container without knowing.

This has been an annoyance of mine a well - I tried to tackle it first with sregistry-cli (that managed a little database of user containers) but it never became popular, probably because people don't care that much to go out of their way to organize. My hope for shpc, along with unifying the interfaces of modules and containers, was also to help with that. I'm not sure a user would go out of their way to use it, but a team of sysadmins maybe? And to be specific, Singularity already handles a cache directory on behalf of a user / group. However if you try to pull/interact with something from the cache it does redundantly copy it, which is what we'd want to avoid.

Well, Nextflow already takes good care of usage barrier/user experience for containers, by abstracting away quite a few aspects of their management (similar to my first point on SHPC above). However, it does not completely solve the second aspect, of having cleaned, rationalised container installations.

Even the concept of "container installations" I like a lot, but I think it can rub people that work on package managers the wrong way to suggest that containers are akin to packages, so therefore shpc is a form of package manager or installer. But I do wonder how we could do this better. E.g., right now we rely on pre-defined recipes, and they are only updated once a month for my own reviewer sanity. One thing I've been thinking about is if we could expose the updater bot, binoc, as an API endpoint so a user could request an update on demand (or for example, run a nightly job). cc-ing @alecbcs on this idea. It's a different idea than what you are discussing in this thread so I won't delve into detail, but basically if we had binoc exposed as an API I think we could update on the fly (without the GitHub action that uses go) and probably with a little bit of implementation here (or if binoc had an API) we could even make module files on the fly. I mention this because I think the liklihood of a user developing a container and submitting it to shpc is small, but perhaps if you could do it on the fly given some URI on a registry it would be easy to do. E.g.,

$ shpc create oras://ghcr.io/singularityhub/github-ci
$ shpc create docker://ghcr.io/buildsi/smeagle

And avoid the annoying work of needing to look up hashes, etc. Okay sorry distracted, back to Nextflow!

Again, I can think of at least two slightly distinct usage scenarios: centralised container modules by sys admins: here Nextflow already works, as this demo pipeline shows container modules by users: here we probably need to code a new feature in Nextflow, but here there are at least two advantages, compared to how Nextflow currently works:

I. avoiding caching the same containers in multiple place, store it once as container module

I'm not super familiar with Nextflow (I've used it a handful of times but never developed for it) - does it indeed pull / obtain a container more than once to run something (e.g., from the cache to some PWD for the workflow)?

II. once the Nextflow installs the container module, it also becomes available to the user for standard, non-Nextflow usage via the shell

Oh that's a neat idea - so it sounds like we would allow the Nextflow user to have some base of shpc modules, and then install to it and use as necessary without the potential redundancy. My one concern here would be if someone grabbed their workflow to run again, given shpc modules aren't available, how would that work?

And to zoom out a bit - imagine if we could design shpc to have basic hooks to allow any kind of workflow manager to use it as command bases? E.g., we define a root, point the workflow manager to it, and then we'd need hooks for checking if a module exists, installing if not, and properly returning a nice set of error codes and messages, etc. I wonder if we have an shpc create and some context of a command found in the container if we could even have that run on the fly to create new modules?

d. if any of these fail, exit with appropriate error message.

Ah okay I see - so no shpc == error, which I don't think is ideal. Maybe it could just fall back to pulling the singularity container as before?

One thing that might be needed on the SHPC side, is appropriate exit codes/error message, depending on the way SHPC can fail, including

I can totally add these!

to provide software in a reproducible way in as many as 3 different ways, including the latest and greatest...SHPC!

Haha I love your enthusiasm! I've also loved working on shpc, it's always the "personal projects not during work time" projects that are the best.

Well, that's really all from me for today... hope this can prompt further conversation!

Yes! Thank you @marcodelapierre !

@marcodelapierre
Copy link
Owner Author

marcodelapierre commented Apr 1, 2022

Hi @vsoch ,
so had some extra thoughts in this space, regarding having Nextflow to shpc install stuff.

After the interesting journey of the symlink_tree, I know have a better picture.
So, we were toying with the idea of a Nextflow feature modules.auto_install_shpc. When enabled, Nextflow would try to shpc install missing modules.

Now, the caveat I was thinking of at the beginning was indeed when people make use of symlinks for the container modules.
E.g. suppose you ask for quay.io/biocontainers/samtools/1.0, but what you get, or would get from SHPC as a findable module is instead samtools/1.0.
Well, now that you and Matthieu have worked on a specification on how to create the shorter symlinked tree...that's the logic we might seek to implement into Nextflow.

Specifically, we could have multiple values for modules.auto_install_shpc in Nextflow, off, on and symlink.
When on, Nextflow would trigger shpc install if module is not available. At execution, Nextflow would look for the literal module name.
When symlink, the main difference is that for execution Nextflow would look for the literal, full name and, if not found, in addition it would apply the same SHPC name shortening convention in symlink_tree, to look for the shortened name.
Note these aspects would apply to point "d." in my original first message in this issue.

I am starting to write these to get your opinion.
Then, I am sure that Paolo, the Nextflow lead, might have further insights, and ideas on how to fine tune this.
However, I would probably wait to involve him until the symlink_tree feature is merged back into the main branch of SHPC.

@vsoch
Copy link
Contributor

vsoch commented Apr 2, 2022

@marcodelapierre picking up here! So I've been thinking about this more general idea of a custom module environment (e.g., here singularityhub/singularity-hpc#539 (comment)) and arguably could shpc support more than one symlink directory? Or more than one install directory? Minimally for how it is now and with nextflow, imagine if we could basically create a module group / environment (not sure what to call it) and this would be what we have now for a symlink tree, e.g.,

shpc new-env ./nextflow-test

And then we add stuff to do? This is similar to how spack does it. For shpc, however, it's just symlinks.

# one of these to say "make a new symlink base here"
shpc add-env ./nextflow-test python:3.9.5-alpine

And then the symlink would be created there, and if it's not installed in the main tree it would be. And then we would just run shpc commands providing that module base / symlink tree to nextflow to load and use. What I don't have a good idea for yet is how we would keep track of these little environments. E.g., we could never be sure where they are, and if say, a module is deleted from the main install that it could update the symlink environment now. Right now we are only allowing a single symlink tree, so it's possible to check.

Ohh an idea! Maybe what we want is to allow people to make named environments (like conda) and then store them in a place we can always write and update? E.g.,

~/.singularity-hpc
   /envs/
     nextflow-test/
     river/
     dinosaur/

And then you would do:

shpc new-env nextflow-test

And then we add stuff to do? This is similar to how spack does it. For shpc, however, it's just symlinks.

shpc add-env nextflow-test python:3.9.5-alpine

And then any uninstall we could always check / update there, because the top level is the directory names. And for module load,
perhaps we would need a helper command to get the full path. E.g.,

$ shpc load nextflow-env

The issue there, of course, is that the user typically needs to run module load in their shell directly, so it might be more like:

$ shpc env nextflow-test
$ /home/dinosaur/.singularity-hpc/envs/nextflow-test
module load $(shpc env nextflow-test)

Just brainstorming! I like the idea that we can solve this needing different combinations of things issues with different named environments, and maybe that is the best idea for the symlink tree - not "one hard coded one" but as many smaller, named ones as you like.

@muffato
Copy link

muffato commented Apr 4, 2022

Hi both ! We're also going to use Nextflow for our pipelines :) so I'm interested in seeing where this goes, even though, TBH, I'm not yet convinced by the shpc mode for two reasons.

  • sharing containers (.sif files) can already be achieved by setting the $NXF_SINGULARITY_CACHEDIR variable, or the equivalent option in a Nextflow config file
  • as far as I know, shpc add can generate a container.yaml on the fly but won't list any executables. Did any of you propose a solution for that ? Quite a long-shot, but perhaps shpc add could pre-populate the list with the content of /usr/local/bin ?

That said, I totally get why one would like to have modules matching the software dependencies of a Nextflow pipeline, for development, debugging, etc. I want to provide such modules too, and then list them as module = in the pipelines alongside the container = and conda = directives, but I didn't imagine Nextflow would need to know shpc is involved.

To me, it's displacing the singularity run but essentially doing the same thing

  1. With a regular singularity setup, Nextflow works like this:

.command.run: singularity run → [within the container: .command.shsoftware commands]

  1. With shpc modules, it would be like this

.command.run.command.shshpc software wrappers: singularity run → [within the container: software commands]

So, maybe the interaction shpc/Nextflow should be rather considered the other direction ? When developing a new Nextflow pipeline:

  1. someone would first request their software to be installed as modules through shpc
  2. Then they test the commands in a terminal with the module loaded
  3. Then they wrap the commands in a Nextflow .nf file without any conda, container or module directive and test it
  4. Then they add these directives and test the .nf file in a separate terminal where the module is not loaded

To me the main issue is that shpc requires a container.yaml to define the executables that are exposed. This means that the process of preparing shpc for a given Nextflow pipeline will be somewhat manual. But there is some hope !

First of all, collecting the containers can be automated. I found this snippet script on the nf-core Slack the other day https://nfcore.slack.com/archives/CE5LG7WMB/p1644235122771249 . Only issue is that the containers it lists may be pre-prepared Singularity images, not Docker images, thus losing the usual path-like name shpc expects, cf:

$ python list_singularity.py 
Specify the name of a nf-core pipeline or a GitHub repository name (user/repo).
? Pipeline name: sanger-tol/readmapping
? Select release / branch: main  [branch]
https://containers.biocontainers.pro/s3/SingImgsRepo/biocontainers/v1.2.0_cv1/biocontainers_v1.2.0_cv1.img
https://depot.galaxyproject.org/singularity/bam2fastx:1.3.1--hb7da652_2
https://depot.galaxyproject.org/singularity/bwa-mem2:2.2.1--he513fc3_0
https://depot.galaxyproject.org/singularity/minimap2:2.21--h5bf99c6_0
https://depot.galaxyproject.org/singularity/mulled-v2-e5d375990341c5aef3c9aff74f96f66f65375ef6:8ee25ae85d7a2bacac3e3139db209aff3d605a18-0
https://depot.galaxyproject.org/singularity/multiqc:1.11--pyhdfd78af_0
https://depot.galaxyproject.org/singularity/multiqc:1.12--pyhdfd78af_0
https://depot.galaxyproject.org/singularity/python:3.9--1
https://depot.galaxyproject.org/singularity/samtools:1.15--h1170115_1

Never mind, the script is essentially just an elaborated grep, so I could modify it to print the containers:

biocontainers/biocontainers:v1.2.0_cv1
quay.io/biocontainers/bam2fastx:1.3.1--hb7da652_2
quay.io/biocontainers/bwa-mem2:2.2.1--he513fc3_0
quay.io/biocontainers/minimap2:2.21--h5bf99c6_0
quay.io/biocontainers/mulled-v2-e5d375990341c5aef3c9aff74f96f66f65375ef6:8ee25ae85d7a2bacac3e3139db209aff3d605a18-0
quay.io/biocontainers/multiqc:1.11--pyhdfd78af_0
quay.io/biocontainers/multiqc:1.12--pyhdfd78af_0
quay.io/biocontainers/python:3.8.3
quay.io/biocontainers/samtools:1.15--h1170115_1

I can put this script in a Gist if you want.

Finally, about the pre-computed environments. Yes, in an HPC setup, one could decide to organise the modules in different groups, one per pipeline, in order to keep it tidy. But I would disconnect that feature from the shpc/Nextflow integration. We don't need to create that dependency.

@vsoch
Copy link
Contributor

vsoch commented Apr 4, 2022

I'll let @marcodelapierre comment because I'm not actively using Nextflow (but am here to support the effort and ideas!)

sharing containers (.sif files) can already be achieved by setting the $NXF_SINGULARITY_CACHEDIR variable, or the equivalent option in a Nextflow config file

I think the idea would be that shpc adds the set of commands that come with the modules.

as far as I know, shpc add can generate a container.yaml on the fly but won't list any executables.

Correct! The executables are kind of manually added - unless you have a container base with a consistent install location (which actually we do for autamus, it's always at /opt/view) you'd have to custom write the executable paths.

Did any of you propose a solution for that ? Quite a long-shot, but perhaps shpc add could pre-populate the list with the content of /usr/local/bin ?

I would say we need to move up one level and have a more formal definition of the container structure. Another idea is to use a tool like tern https://github.com/tern-tools/tern/ which is intended to scan containers to general SBOMs, but instead use that same logic to find packages installed by known package managers. Of course for custom or special software, you would miss that. A third idea is to get really messy and look into the docker history and see if you can derive "important locations" for things, and then look for binaries there.

That said, I totally get why one would like to have modules matching the software dependencies of a Nextflow pipeline, for development, debugging, etc. I want to provide such modules too, and then list them as module = in the pipelines alongside the container = and conda = directives, but I didn't imagine Nextflow would need to know shpc is involved.

Agree, and I think we would want to be able to use shpc for "any kind of workflow thing you can run on shpc that could use a throwaway environment"

To me, it's displacing the singularity run but essentially doing the same thing

Agree - it's just another means to provide an executable, but maybe one step easier only given that the container.yaml is already in the registry, and it has the binaries we need from our pipeline. A "manual" solution would be to look at something like snakemake wrappers and just make sure we have a container to match the most popular.

I can't comment because I'm not super experienced with Nextflow, but the proposed interaction @marcodelapierre summarized here: https://github.com/marcodelapierre/demo-shpc-nf#4-run-with-singularity-hpc-shpc

and:

$ nextflow run main.nf -profile shpc

To me the main issue is that shpc requires a container.yaml to define the executables that are exposed. This means that the process of preparing shpc for a given Nextflow pipeline will be somewhat manual. But there is some hope !

I think if the modules are provided, loaded, it would work, but arguably we could also let shpc create a view and install to it on behalf of the user (this is the idea I had in mind with views - quick "throwaway" environments).

Finally, about the pre-computed environments. Yes, in an HPC setup, one could decide to organise the modules in different groups, one per pipeline, in order to keep it tidy. But I would disconnect that feature from the shpc/Nextflow integration. We don't need to create that dependency.

Right! So they are different use cases, but I think could use the same underlying feature. A view could be quick "create, use and throw away" or "create and use for a group of modules, don't throw away!"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants