Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nf-core download <pipeline name> #53

Merged
merged 12 commits into from
Jun 8, 2018
Merged

Conversation

ewels
Copy link
Member

@ewels ewels commented May 30, 2018

New subcommand to download a pipeline and its singularity containers.

  • Validates pipeline input, fetches workflow details and release hash
  • Download workflow files
  • If -sis specified, downloads singularity images
    • Automatically finds container names from workflow config
    • Looks on the singularity hub API for a singularity image - downloads it if found
      • Includes md5 check
    • If not found, tries to pull singularity images using singularity
    • If singularity is not installed, tries using docker (with docker2singularity)
      • removed due to configuration difficulties - may be possible to put back later with more investigation

To do:

  • Check that it works for other people
  • Write tests!

Example usage:

$ nf-core download methylseq -s

                                          ,--./,-.
          ___     __   __   __   ___     /,-._.--~\
    |\ | |__  __ /  ` /  \ |__) |__         }  {
    | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                          `._,._,'


INFO: Saving to 'nf-core-methylseq-1.0'

INFO: Downloading workflow files from GitHub

INFO: Fetching container names for workflow

INFO: Building singularity image 'nfcore/methylseq:1.0', saving to /Users/ewels/GitHub/nf-core/testing/nf-core-methylseq-1.0/singularity-images/nf-core-methylseq-1.0.simg

WARNING: Singularity is not installed. Attempting to use Docker instead.
Size: 4720 MB for the singularity container
(1/9) Creating an empty image...
Creating a sparse image with a maximum size of 4720MiB...
Using given image size of 4720
Formatting image (/sbin/mkfs.ext3)
Done. Image can be found at: /tmp/nfcore_methylseq_1.0-2018-04-17-d18eb7142a18.img
(2/9) Importing filesystem...
(3/9) Bootstrapping...
(4/9) Adding run script...
(5/9) Setting ENV variables...
Singularity: sexec (U=0,P=143)> Command=exec, Container=/tmp/nfcore_methylseq_1.0-2018-04-17-d18eb7142a18.img, CWD=/, Arg1=/bin/sh
(6/9) Adding mount points...
Singularity: sexec (U=0,P=149)> Command=exec, Container=/tmp/nfcore_methylseq_1.0-2018-04-17-d18eb7142a18.img, CWD=/, Arg1=/bin/sh
(7/9) Fixing permissions...
Singularity: sexec (U=0,P=155)> Command=exec, Container=/tmp/nfcore_methylseq_1.0-2018-04-17-d18eb7142a18.img, CWD=/, Arg1=/bin/sh
Singularity: sexec (U=0,P=196)> Command=exec, Container=/tmp/nfcore_methylseq_1.0-2018-04-17-d18eb7142a18.img, CWD=/, Arg1=/bin/sh
Debian GNU/Linux 9 \n \l
We're not running on BusyBox/Buildroot
Singularity: sexec (U=0,P=202)> Command=exec, Container=/tmp/nfcore_methylseq_1.0-2018-04-17-d18eb7142a18.img, CWD=/, Arg1=/bin/sh
(8/9) Stopping and removing the container...
d18eb7142a18
d18eb7142a18
(9/9) Moving the image to the output folder...
  2,381,807,616  48%   10.97MB/s    0:03:48

Example output:

$ tree -L 3
.
└── nf-core-methylseq-1.0
    ├── singularity-images
    │   └── nf-core-methylseq-1.0.simg
    └── workflow
        ├── CHANGELOG.md
        ├── Dockerfile
        ├── LICENCE.md
        ├── README.md
        ├── assets
        ├── bin
        ├── conf
        ├── docs
        ├── environment.yml
        ├── main.nf
        ├── nextflow.config
        └── tests

9 directories, 7 files

@ewels ewels added enhancement WIP Work in progress labels May 30, 2018
@ewels ewels requested a review from a team May 30, 2018 23:24
@codecov-io
Copy link

codecov-io commented May 30, 2018

Codecov Report

Merging #53 into master will decrease coverage by 22.16%.
The diff coverage is 14.28%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master      #53       +/-   ##
===========================================
- Coverage    94.8%   72.63%   -22.17%     
===========================================
  Files           4        6        +2     
  Lines         500      665      +165     
===========================================
+ Hits          474      483        +9     
- Misses         26      182      +156
Impacted Files Coverage Δ
nf_core/download.py 0% <0%> (ø)
nf_core/utils.py 100% <100%> (ø)
nf_core/lint.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 360e41a...0278018. Read the comment docs.

* Validates pipeline input, fetches workflow details and release hash
* Download workflow files
* If -s specified, automatically finds container names from workflow config
* Tries to pull singularity images using first singularity, then docker
scripts/nf-core Outdated
help = "Pipeline release"
)
@click.option(
'-s/-n', '--singularity/--no-singularity',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to have something like -s yes / -s no instead?
I find it disturbing to have two options for two opposites things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess - the above is the standard way to handle boolean flags with click (see docs), so this is the way I've done it in the past. But I don't really mind either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I didn't knew the standards.
I'm fine with it then ;-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I did it as a single flag first - just -s to get singularity images and nothing to not get them. But then I thought it was better to be explicit and require this input from the user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, after playing a bit more I agree. As it's required the help message says that it's missing -s if not supplied, which is kind of counter-intuitive.

Just need to find a nice way to handle y|Y|yes|Yes|YES for both options now. I bet click has a nice tidy way to handle this for you, if I can just find it..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I too would prefer just one parameter -s or --singularity-images with yes/no as only accepted values, that can be done by using choices or giving a call_back.

Or just --singularity/--no-singularity (without -s/-n)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I half implemented the Choice thing this morning, but if we want to handle case insensitivity and short/long form then the list of allowed options gets messy... 🤓 (y|Y|yes|Yes|YES|n|N|no|No|NO).

We could write a custom callback or string handling behaviour, but that feels like overkill and the help text is not particularly pretty then. click already has this kind of handling for the --yes behaviour. click.BOOL can also handle string values, so I figure that click probably has built-in support for this somehow...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah!! looks like setting type=click.BOOL does the trick right ? for y/n values

Option:

@click.option('-s', '--singularity', type=click.BOOL, default=None, required=True)

Test:

$ python cli_test.py test -s y
True
$ python cli_test.py test -s yes
True
$ python cli_test.py test -s YEs
True
$ python cli_test.py test -s YeS
True
$ python cli_test.py test -s nO
False
$ python cli_test.py test -s n
False
$ python cli_test.py test -s NO
False
$ python cli_test.py test -s NOo
Usage: cli_test.py test [OPTIONS]

Error: Invalid value for "-s" / "--singularity": noo is not a valid boolean
$ python cli_test.py test -s 0
False
$ python cli_test.py test -s 1
True
 $ python cli_test.py test -s 2
Usage: cli_test.py test [OPTIONS]

Error: Invalid value for "-s" / "--singularity": 2 is not a valid boolean

scripts/nf-core Outdated
help = "Pipeline release"
)
@click.option(
'-s/-n', '--singularity/--no-singularity',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I too would prefer just one parameter -s or --singularity-images with yes/no as only accepted values, that can be done by using choices or giving a call_back.

Or just --singularity/--no-singularity (without -s/-n)


# If we got this far, must not be a nf-core pipeline
if self.pipeline.count('/') == 1:
# Looks like a GitHub address - try working with this repo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not check this upstream ? i.e. validate the param value at very beginning. That could improve the success rate of downloading something even higher 😉

Also could be good to log that it is not a nf-core pipeline (?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, should do both 👍 Was rushing to push the code so that I wasn't late into work for Remi's talk 😉

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is as early as I can check I think. In order to validate the pipeline name, I have to fetch the pipeline names from the website API and see if the supplied name matches any of them (the above for wf in wfs.remote_workflows loop). So this point in the code is the first time that I know it's not an nf-core pipeline.

Agree that logging is a good idea 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging added in fd90e46

# Check that the outdir doesn't already exist
if os.path.exists(wf.outdir):
logging.error("Output directory '{}' already exists".format(wf.outdir))
sys.exit(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really wanna exit ? If so why not have this check at the beginning (like as soon as receiving the parameter or in the class) that will save us all the trouble of doing everything 😇

Or just warn and save in a custom name so they can rename the folder later ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, should probably handle this in a nicer way 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I remember..

If so why not have this check at the beginning

I can't check any earlier, as if --outdir is not specified, the default output directory name is generated based on the workflow name and version number. The version number is the last release if not specified. So basically, I don't know what the output directory name will be before I get the workflow details.

Or just warn and save in a custom name so they can rename the folder later?

Nah - this command isn't going to be super common to run (like multiqc or nextflow or something), so it's not such a big deal to just bail and exit I think. Less code and potentially less confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see the point now 👍


import nf_core.list, nf_core.lint

def download_workflow(pipeline, release=None, singularity=False, outdir=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, Why make it as independent function rather than not including it in the Class below ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't remember, was copying my earlier code structure for the other modules.

* Moved the nextflow config parsing into new common utils.py module
* Made singularity download option a single command line boolean flag
* Improved logging for download script
@ewels
Copy link
Member Author

ewels commented May 31, 2018

Now waiting for singularityhub/docker2singularity#30 to be merged so that we can create newer better singularity images using docker ✨

logging.info("Pipeline is in development. Using current code on master branch.")

# Set outdir name if not defined
if self.outdir is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this block can be moved up to L40 (keeping all outdir related stuff together), so this function will be clean and does only what the name suggests ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you're getting at, but I respectfully disagree 😆 I'm just setting all of the different variables in this function that will be used later on, including the outdir and the download link.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe, it is a matter of personnel choice/prefference, so no right or wrong here 😇It makes sense in this context to update necessary variables (including outdir) in this function as this function was called by download_workflow (which uses outdir later on).

I suggested this considering more general context i.e. when (or if) fetch_workflow_details is called by any other function or script that don't have or need outdir variable, then updating outdir would be "out of context" 😜(Not an error but its just something not necessary then). That is why I would have put this part up directly in download_workflow 😉(Of course, since the class itself is more "download" related this might not happen, but like I said it is matter of personnel preference)

self.download_singularity_image(container)


def get_workflow(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to be picky 😉 -> is it really a "get" function ? I mean it returns only True/False. So maybe naming set_workflow or collect_workflow or something else be more appropriate

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I'll give you this one ;)

Copy link
Member

@sven1103 sven1103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor stuff

logging.warn("Pipeline name doesn't match any nf-core workflows")
logging.info("Pipeline name looks like a GitHub address - attempting to download anyway")
self.wf_name = self.pipeline
if self.release is None:
Copy link
Member

@sven1103 sven1103 May 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better if not self.release

if self.release is None:
self.release = 'master'
self.wf_sha = self.release
if self.outdir is None:
Copy link
Member

@sven1103 sven1103 May 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better if not self.outdir

self.outdir += '-{}'.format(self.release)
# Set the download URL and return
self.wf_download_url = 'https://github.com/{}/archive/{}.zip'.format(self.pipeline, self.release)
return True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of boolean return values in such functions, that are not a question, like is_finished or so. Better raise a meaningful expection like raise LookupError("Pipeline {} not found on whatever".format(pipeline))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! Yes, I felt a bit dirty writing this but it worked. Now you've suggested the raise it seems obvious 😆

logging.error("Not able to find pipeline '{}'".format(self.pipeline))
logging.info("Available pipelines: {}".format(', '.join([w.name for w in wfs.remote_workflows])))
return False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise Exception here

@sven1103 sven1103 mentioned this pull request May 31, 2018
* if not x: instead of if x is None:
* Raise and catch exceptions instead of returning booleans
@ewels
Copy link
Member Author

ewels commented Jun 5, 2018

Hacky in-progress WIP code pushed sorry. Trying to download directly from singularity hub - I feel like it should be easy (curl works fine), but the requests.get line hangs for some reason. Haven't figured out why yet.

@sven1103
Copy link
Member

sven1103 commented Jun 5, 2018

With "hang" you mean timeout? Or just forever?

@ewels
Copy link
Member Author

ewels commented Jun 5, 2018

Just forever. Timeout param has no effect. A very fast minimal example in a live python environment seemed fine, so I’m not sure what’s going on. Ran out of time and had to run into work for a meeting though..

with open(out_path, 'wb') as f:
for chunk in dl_request.iter_content(chunk_size=1024):
if chunk:
f.write(chunk)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do f.flush() too :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was intentional, due to this stack overflow comment

@ewels ewels removed the WIP Work in progress label Jun 5, 2018
@ewels
Copy link
Member Author

ewels commented Jun 5, 2018

Missing tests, but done for now unless anyone has any more suggestions I think!

@sven1103
Copy link
Member

sven1103 commented Jun 6, 2018

-s, --singularity [y/n] Download singularity containers? [required]
I am not so sure about that. An option is an option, and never required for a CLI to work. Otherwise it is a positional parameter. Just make it as boolean flag with default value False. Done

@sven1103
Copy link
Member

sven1103 commented Jun 6, 2018

Next question: if the --singularity option is n, it is not pulling the container from DockerHub. Why? :D

@ewels
Copy link
Member Author

ewels commented Jun 6, 2018

Just make it as boolean flag with default value False. Done

This is what I did at the start of the pull request, but then I changed it at the request of the reviewers 😅 (sort of, anyway. See discussion)

Also note that I found having it as a flag with default False was quite annoying as I kept forgetting to specify it. For this reason I preferred having it as a required option.

Next question: if the --singularity option is n, it is not pulling the container from DockerHub. Why? :D

That's what the flag is for :) Not everyone uses singularity, but anyone running offline will need to download the pipeline files. So these should always be downloaded, but the singularity image should be optional. One could think of other optional downloads in the future too - reference genome files for example.

Downloading from singularity hub / docker hub shouldn't really matter. shub is preferable as it doesn't require singularity to be installed. There is no method to affect this through the command line, and I'm not really sure that there needs to be.


Basically, this was the logic I used to get to this interface:

  1. People shouldn't have to download singularity images if they don't need them
  2. Explicit choice of this is easier and will probably help to avoid unnecessary confusion / annoyance
  3. Positional Boolean options don't really make sense (I don't think I've ever seen them anyway). Personally I prefer a required flag.

Happy to change this if anyone has a better solution though!

@ewels ewels requested a review from a team June 6, 2018 21:25
@sven1103
Copy link
Member

sven1103 commented Jun 7, 2018

This is what I did at the start of the pull request, but then I changed it at the request of the reviewers sweat_smile (sort of, anyway. See discussion)

I am sorry, I really missed that.

Also note that I found having it as a flag with default False was quite annoying as I kept forgetting to specify it. For this reason I preferred having it as a required option.

Don't want to be picky, but we should stick to common Unix-like CLI. A required option is not an option. If you want to download the singularity image, provide the --singularity flag, else don't. If one wants to see the options, type -h/--help as for every UNIX CLI.

--singularity Pulls a singularity image with the software tools for the pipeline.

With the rest I totally agree, and I like the feature very much :)

So, now beat me :P

@apeltzer
Copy link
Member

apeltzer commented Jun 7, 2018

This can be basically seen from two points of view: Do we want people to be able to just run the command and then everything they need is there or do we want them to use this tool for specific use cases and then require them to specify exactly what they want to have?

I'd rather say keep it optional, as the download might take quite a while and then people can choose if they want to do that on a per-user basis. I understand having the problem to forget it, but also downloading things that you don't want everytime is quite similarly annoying ;-)

@ewels
Copy link
Member Author

ewels commented Jun 7, 2018

Yup, I didn't really have a strong opinion on this and agree with all of the points made. I'll switch the current logic with a single -s/--singularity flag which enables the download (disabled by default). Thanks for the discussion all! 👍

@senthil10
Copy link
Contributor

Would it be good/helpful to be able to just download the container images for given pipeline ?
(Like if someone forgot to give -s the first time)

@sven1103
Copy link
Member

sven1103 commented Jun 7, 2018

@senthil10 I think this is a useful feature! 👍

probably with a flag --get-container (or similar)?

@ewels
Copy link
Member Author

ewels commented Jun 7, 2018

I'm a little confused by this last request - in discussion I think it was about rerunning the command if we forget to specify -s, but without having to download the pipeline files a second time. I would prefer not to do this, as using local files makes the versioning less failsafe and downloading them a second time only takes a second.

We have to download the pipeline files to get the container, as the pipeline config is parsed to get the container name dynamically. So we could do a special mode which does this and then deletes those files (or just saves them to /tmp), but I feel that this is kind of overkill in this scenario to be honest..

@sven1103
Copy link
Member

sven1103 commented Jun 7, 2018

@ewels Ah ok, I see. Ok, then let's not stress this pull request to much, there are still tests to write. @senthil10 We can keep this in our mind, if there is a future request for such a feature coming up, we can still brainstorm about and implement it.

@ewels: I am happy to write the tests if you like :)

@senthil10
Copy link
Contributor

@sven1103 yes, I agree too 👍 @ewels have already done a great job with this PR :)

As you mentioned we can add this in the future depending upon need (maybe something to debate during nf-core hackathon 😉 ).

In general I think it can be a good feature to have, though its not a crucial feature it would still give little more flexibility for advance users. For instance if a user downloads just the pipeline first, then tweaks the pipeline a bit for fun or internal necessity and later if they ran into a situation where they needed the containers this feature might come in handy. It can be either done by nf-core or by the pipeline itself (a script to download all related containers). @maxulysse have already implemented this in sarek with a buildContainers script.

@ewels
Copy link
Member Author

ewels commented Jun 7, 2018

@sven1103 - if you could write the tests that would of course be amazing 🎉 😉

@senthil10 - I still don't really think that this would be particularly useful, since our pipelines have only one container (Sarek has loads, which is why a script makes sense). So it really is just singularity pull <name> which for advanced users is not hard.

The point of this download script is to specifically to:

  1. get both the pipeline script and containers in one command
  2. do it safely for a tagged, stable release only, for reproducibility.

@ewels
Copy link
Member Author

ewels commented Jun 7, 2018

...and if someone could leave a positive review and merge, that'd be fab 😉

@senthil10
Copy link
Contributor

@ewels I did say "it is not crucial, it would just give more flexibility" and I said that thinking of packages with multiple containers, I dint know all the packages in nf-core have only one container 😅

Copy link
Contributor

@senthil10 senthil10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As usual, extremely good coding @ewels 👏 👏

image

P.S: I will leave the merging to @sven1103 or others 😇

@sven1103
Copy link
Member

sven1103 commented Jun 8, 2018

@ewels I think you did a great implementation with this feature 🎉 ! I really like it! I will merge and write the tests over the weekend.

@sven1103 sven1103 merged commit f2862ae into nf-core:master Jun 8, 2018
@maxulysse
Copy link
Member

@senthil10 @ewels The buildContainers script is IMHO only useful when having to build several containers.
With it I can build and push several Docker containers at once thanks to Nextflow.
Same thing with pulling Singularity containers.

It was also a good opportunity to fun trying some Nextflow stuff.

@ewels ewels deleted the nf-core-download branch June 8, 2018 11:48
@ewels
Copy link
Member Author

ewels commented Jun 8, 2018

Thanks for the discussion everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants