-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ideas around automating container module installation by Nextflow #2
Comments
Another thing I want to do is to ping them to show them the demo pipeline! ... but not now, my brain is completely boiled! Be back on Monday |
Adding some more reflections. First: forget about my "digression" above on the "namespace" trick, "shpc show" and so on, I think I was going off track. So...I really think yours is a great idea, worth developing and bringing to the attention of Nextflow. Some context for my thinking. I always mention
Why am I telling all this? That can be a good advantage in promoting container modules vs simple containers in Nextflow... Again, I can think of at least two slightly distinct usage scenarios:
Now, slightly repeating my comment above, with more context. One thing that might be needed on the SHPC side, is appropriate exit codes/error message, depending on the way SHPC can fail, including
Then, Nextflow can pick up these exit codes and print appropriate error messages (or, it can just lazily print whatever error message SHPC produces).
And as I said, I am happy to start pinging the Nextflow devs by showing them this demo pipeline, and the beauty of being able to provide software in a reproducible way in as many as 3 different ways, including the latest and greatest...SHPC! Well, that's really all from me for today... hope this can prompt further conversation! |
@vsoch @pditommaso have a look at my thoughts in the last point above, on possible further interaction between Nextflow and Singularity-HPC (SHPC) |
It's really fun to think about these ideas! Here are some thoughts.
This has been an annoyance of mine a well - I tried to tackle it first with sregistry-cli (that managed a little database of user containers) but it never became popular, probably because people don't care that much to go out of their way to organize. My hope for shpc, along with unifying the interfaces of modules and containers, was also to help with that. I'm not sure a user would go out of their way to use it, but a team of sysadmins maybe? And to be specific, Singularity already handles a cache directory on behalf of a user / group. However if you try to pull/interact with something from the cache it does redundantly copy it, which is what we'd want to avoid.
Even the concept of "container installations" I like a lot, but I think it can rub people that work on package managers the wrong way to suggest that containers are akin to packages, so therefore shpc is a form of package manager or installer. But I do wonder how we could do this better. E.g., right now we rely on pre-defined recipes, and they are only updated once a month for my own reviewer sanity. One thing I've been thinking about is if we could expose the updater bot, binoc, as an API endpoint so a user could request an update on demand (or for example, run a nightly job). cc-ing @alecbcs on this idea. It's a different idea than what you are discussing in this thread so I won't delve into detail, but basically if we had binoc exposed as an API I think we could update on the fly (without the GitHub action that uses go) and probably with a little bit of implementation here (or if binoc had an API) we could even make module files on the fly. I mention this because I think the liklihood of a user developing a container and submitting it to shpc is small, but perhaps if you could do it on the fly given some URI on a registry it would be easy to do. E.g., $ shpc create oras://ghcr.io/singularityhub/github-ci
$ shpc create docker://ghcr.io/buildsi/smeagle And avoid the annoying work of needing to look up hashes, etc. Okay sorry distracted, back to Nextflow!
I'm not super familiar with Nextflow (I've used it a handful of times but never developed for it) - does it indeed pull / obtain a container more than once to run something (e.g., from the cache to some PWD for the workflow)?
Oh that's a neat idea - so it sounds like we would allow the Nextflow user to have some base of shpc modules, and then install to it and use as necessary without the potential redundancy. My one concern here would be if someone grabbed their workflow to run again, given shpc modules aren't available, how would that work? And to zoom out a bit - imagine if we could design shpc to have basic hooks to allow any kind of workflow manager to use it as command bases? E.g., we define a root, point the workflow manager to it, and then we'd need hooks for checking if a module exists, installing if not, and properly returning a nice set of error codes and messages, etc. I wonder if we have an
Ah okay I see - so no shpc == error, which I don't think is ideal. Maybe it could just fall back to pulling the singularity container as before?
I can totally add these!
Haha I love your enthusiasm! I've also loved working on shpc, it's always the "personal projects not during work time" projects that are the best.
Yes! Thank you @marcodelapierre ! |
Hi @vsoch , After the interesting journey of the Now, the caveat I was thinking of at the beginning was indeed when people make use of symlinks for the container modules. Specifically, we could have multiple values for I am starting to write these to get your opinion. |
@marcodelapierre picking up here! So I've been thinking about this more general idea of a custom module environment (e.g., here singularityhub/singularity-hpc#539 (comment)) and arguably could shpc support more than one symlink directory? Or more than one install directory? Minimally for how it is now and with nextflow, imagine if we could basically create a module group / environment (not sure what to call it) and this would be what we have now for a symlink tree, e.g., shpc new-env ./nextflow-test And then we add stuff to do? This is similar to how spack does it. For shpc, however, it's just symlinks. # one of these to say "make a new symlink base here"
shpc add-env ./nextflow-test python:3.9.5-alpine And then the symlink would be created there, and if it's not installed in the main tree it would be. And then we would just run shpc commands providing that module base / symlink tree to nextflow to load and use. What I don't have a good idea for yet is how we would keep track of these little environments. E.g., we could never be sure where they are, and if say, a module is deleted from the main install that it could update the symlink environment now. Right now we are only allowing a single symlink tree, so it's possible to check. Ohh an idea! Maybe what we want is to allow people to make named environments (like conda) and then store them in a place we can always write and update? E.g., ~/.singularity-hpc
/envs/
nextflow-test/
river/
dinosaur/ And then you would do: shpc new-env nextflow-test And then we add stuff to do? This is similar to how spack does it. For shpc, however, it's just symlinks. shpc add-env nextflow-test python:3.9.5-alpine And then any uninstall we could always check / update there, because the top level is the directory names. And for module load, $ shpc load nextflow-env The issue there, of course, is that the user typically needs to run module load in their shell directly, so it might be more like: $ shpc env nextflow-test
$ /home/dinosaur/.singularity-hpc/envs/nextflow-test
module load $(shpc env nextflow-test) Just brainstorming! I like the idea that we can solve this needing different combinations of things issues with different named environments, and maybe that is the best idea for the symlink tree - not "one hard coded one" but as many smaller, named ones as you like. |
Hi both ! We're also going to use Nextflow for our pipelines :) so I'm interested in seeing where this goes, even though, TBH, I'm not yet convinced by the
That said, I totally get why one would like to have modules matching the software dependencies of a Nextflow pipeline, for development, debugging, etc. I want to provide such modules too, and then list them as To me, it's displacing the
So, maybe the interaction shpc/Nextflow should be rather considered the other direction ? When developing a new Nextflow pipeline:
To me the main issue is that First of all, collecting the containers can be automated. I found this snippet script on the nf-core Slack the other day https://nfcore.slack.com/archives/CE5LG7WMB/p1644235122771249 . Only issue is that the containers it lists may be pre-prepared Singularity images, not Docker images, thus losing the usual path-like name shpc expects, cf:
Never mind, the script is essentially just an elaborated
I can put this script in a Gist if you want. Finally, about the pre-computed environments. Yes, in an HPC setup, one could decide to organise the modules in different groups, one per pipeline, in order to keep it tidy. But I would disconnect that feature from the shpc/Nextflow integration. We don't need to create that dependency. |
I'll let @marcodelapierre comment because I'm not actively using Nextflow (but am here to support the effort and ideas!)
I think the idea would be that shpc adds the set of commands that come with the modules.
Correct! The executables are kind of manually added - unless you have a container base with a consistent install location (which actually we do for autamus, it's always at /opt/view) you'd have to custom write the executable paths.
I would say we need to move up one level and have a more formal definition of the container structure. Another idea is to use a tool like tern https://github.com/tern-tools/tern/ which is intended to scan containers to general SBOMs, but instead use that same logic to find packages installed by known package managers. Of course for custom or special software, you would miss that. A third idea is to get really messy and look into the docker history and see if you can derive "important locations" for things, and then look for binaries there.
Agree, and I think we would want to be able to use shpc for "any kind of workflow thing you can run on shpc that could use a throwaway environment"
Agree - it's just another means to provide an executable, but maybe one step easier only given that the container.yaml is already in the registry, and it has the binaries we need from our pipeline. A "manual" solution would be to look at something like snakemake wrappers and just make sure we have a container to match the most popular. I can't comment because I'm not super experienced with Nextflow, but the proposed interaction @marcodelapierre summarized here: https://github.com/marcodelapierre/demo-shpc-nf#4-run-with-singularity-hpc-shpc and: $ nextflow run main.nf -profile shpc
I think if the modules are provided, loaded, it would work, but arguably we could also let shpc create a view and install to it on behalf of the user (this is the idea I had in mind with views - quick "throwaway" environments).
Right! So they are different use cases, but I think could use the same underlying feature. A view could be quick "create, use and throw away" or "create and use for a group of modules, don't throw away!" |
Expanding on @vsoch comment in #1 :
First of all, thanks for starting this conversation, sounds very interesting to me!
Let me try and add further aspects.
It would be super cool if Nextflow were able to install container modules by itself!
Current behaviour: if a requested module is not found, exit with error.
Desirable, if module not found:
a. look for SHPC availability in the shell environment
b. look for recipe availability of the needed module (shpc show)
c. shpc install the container
d. attempt the module load again
e. if any of these fail, exit with appropriate error message.
[UPDATE]: ignore the following bit
The text was updated successfully, but these errors were encountered: