Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Install R packages before copying repo contents #718

Merged
merged 3 commits into from Jul 16, 2019

Conversation

@betatim
Copy link
Member

betatim commented Jun 25, 2019

This is the R part of #716. Needs rebasing after #716 gets merged.

@betatim

This comment has been minimized.

Copy link
Member Author

betatim commented Jun 25, 2019

A good example and alternative to moving install.R early is linked in #487 (comment). It is a buildpack that uses a different base image.

One thing I don't understand is how installing a specific R package (selected by MRAN date) would work out faster than with the current approach if we start from (say) rocker/binder:3.6.0 as base image. Chances are tidyverse from 2019-04-11 isn't the one already installed so we have to install it again.

The other thing that isn't clear to me: if I run install.package('random-package') in the rocker/binder:3.6.0 image before and after random-package make a release, how is the version that gets installed selected?


I see three four options:

  1. move install.R forwards (proposed here)

  2. adopt the buildpack-with-custom-base from Yuvi's example and make a RockerBuildPack, then recommend people use that over the R buildpack

  3. make installing popular R packages from a specific MRAN date part of the "basics" you get when you select the R buildpack in repo2docker. Right now we already select the Shiny version for users.

  4. move install.R forwards but make it Ok if it fails (add a || true on the end), then rerun Rscript install.R after copying the repo. It seems install.packages('tidyverse') only takes a few seconds if you've previously installed tidyverse. This would give us the best of both worlds: fast rebuilds and recovery if users do actually refer to the repo being built in their install.R.

@yuvipanda

This comment has been minimized.

Copy link
Collaborator

yuvipanda commented Jun 25, 2019

Of the 4, I like 4 as an experiment :)

I think for (2), my intent wasn't to make it be part of repo2docker, but to upstream it to https://github.com/rocker-org/binder.

@betatim betatim force-pushed the betatim:early-r branch from bda4ebd to 3858ba5 Jun 25, 2019
@betatim

This comment has been minimized.

Copy link
Member Author

betatim commented Jun 26, 2019

And because there is never anything new under the sun: some prior art #410 that we should look at and ponder.

@betatim betatim force-pushed the betatim:early-r branch from 3858ba5 to 7f4af32 Jun 29, 2019
@betatim

This comment has been minimized.

Copy link
Member Author

betatim commented Jun 29, 2019

Option 4 is now implemented. I decided to use a empty file to store the state of "pre-assembled worked" as sometimes re-running an install.packages() did take just as long as the first time :-/

Test it with something like https://github.com/betatim/r-caching (which tries to read the README.md in install.R).

If the pre-assembly step succeeds we do not re-run it after copying the
repository contents.
@betatim betatim force-pushed the betatim:early-r branch from 7f4af32 to 66227f8 Jul 6, 2019
@yuvipanda

This comment has been minimized.

Copy link
Collaborator

yuvipanda commented Jul 7, 2019

Thanks for implementing that, @betatim. This is definitely a lot cleaner in terms of useful error messages and what not.

I'm very conflicted on this particular PR, on two issues:

  1. install.R performance is absolutely, almost unacceptably terrible right now, and something like this will definitely make it a lot better
  2. This might still have weird side effects. I don't actually know what R does if a package install fails halfway through (because it dependended on local filesystem?) and then you re-do it. This kinda consistency issues aren't something we have to deal with right now, since we just evict cache instead.

So it's a tradeoff between performance and a cleaner conceptual mental model. Based on the work you had done earlier looking at repos that have an install.R, and conversations in #716, I feel this is an ok tradeoff if we make sure to document it heavily - including the fact that this is an explicit tradeoff we made for this case.

We should also add a section in https://repo2docker.readthedocs.io/en/latest/design.html about performance tradoffs and how to evaluate them. (doesn't need to block this PR though)

Again, thank you for diligently pushing this through :D <3

@yuvipanda

This comment has been minimized.

Copy link
Collaborator

yuvipanda commented Jul 7, 2019

I restarted the travis job, which I think was a flake.

@yuvipanda

This comment has been minimized.

Copy link
Collaborator

yuvipanda commented Jul 7, 2019

I'm happy to merge this after:

  • Add doc blurb to install.R docs
  • Ping @cboettig and @karthik to see what they think
  • Open an issue to figure out how we can test 'cache re-use' when building images
    (#733)
@betatim

This comment has been minimized.

Copy link
Member Author

betatim commented Jul 7, 2019

The "external" job in travis will keep failing till #732 or something like it is merged.

It'll be good to hear where @cboettig and @karthik think this approach could fail, how likely it is that someone will hit that problem and how serious the outcome would be.

@minrk

This comment has been minimized.

Copy link
Member

minrk commented Jul 9, 2019

This is looking so familiar that I thought we already had this, and now I realize that I never finished #410 . Maybe it's worth pulling that in here, since it's the conda and Python implementation of this same thing?

update: Re-read comments and saw that #410 was already mentioned. I'll look at redoing #410 with the same APIs here.

repo2docker/buildpacks/base.py Outdated Show resolved Hide resolved
@betatim

This comment has been minimized.

Copy link
Member Author

betatim commented Jul 9, 2019

#410 is indeed the parent of all these ideas! I didn't pull in the pip and conda changes here because I thought we should focus on the R part. In particular the fact that we can't tell if install.R will succeed or not before executing it. This seems like the trickiest part here.

Once we merge this we can reboot #410 and it should be a "no brainer" to merge (at least I can't think why you'd merge this but oppose #410 :) ).

@betatim betatim changed the title [WIP] Install R packages before copying repo contents [MRG] Install R packages before copying repo contents Jul 9, 2019
@minrk

This comment has been minimized.

Copy link
Member

minrk commented Jul 11, 2019

My only concern with running the install twice is that I'm not 100% sure that a failed execution followed by a successful one is idempotent with a single successful execution. E.g. if it unconditionally does something that will fail on the second run, but this bit is before the step when install.R fails.

I could write a totally fake install.R that does this, but I'm not sure it's a realistic concern. It's just that there would be no real recourse for folks where install really does require the whole repo to skip the first failing install. I hate adding more files, but is there a way to opt into or out of the first try? E.g. a magic comment at the top of install.R?

@betatim

This comment has been minimized.

Copy link
Member Author

betatim commented Jul 12, 2019

From a totally unscientific study of maybe 30 or 40 install.R files I could find via GitHub search I think the chances are low we will run into the "install.R isn't idempotent" or the weaker "running install.R without repo and then with repo context will not do the right thing" (this is what we need).

So I would chance it and when someone reports a problem (or we spot one on https://grafana.mybinder.org/d/fZWsQmnmz/pod-activity?refresh=1m&panelId=8&fullscreen&orgId=1&from=now-12h&to=now&var-cluster=prometheus) we should introduce the "magic comment to opt out".

It felt like most install.Rs (that weren't clearly "not made for Binder") call install.packages() and the "install from GitHub" version of it. Some define a list of package names and then iterate over it. I didn't see one that read the package list in from a file (in Python setup.py often reads the requirements.txt). I don't know if there is an equivalent of pip install -e., if there is I didn't see it used.

TL;DR: I'd wait to see if we get build failures on mybinder.org where people would need to opt out, then implement a magic comment.

@betatim betatim force-pushed the betatim:early-r branch from cc63262 to 09b8481 Jul 12, 2019
@betatim

This comment has been minimized.

Copy link
Member Author

betatim commented Jul 12, 2019

ps. @minrk do you want to reboot #410 or should/could someone else make a new PR?

@minrk

This comment has been minimized.

Copy link
Member

minrk commented Jul 16, 2019

Makes sense. I'll give resurrecting #410 a go. Thanks!

@minrk minrk merged commit 5ef42fc into jupyter:master Jul 16, 2019
3 checks passed
3 checks passed
ci/circleci: build_docs Your tests passed on CircleCI!
Details
ci/dockercloud Your tests passed in Docker Cloud
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@betatim betatim deleted the betatim:early-r branch Jul 16, 2019
@minrk

This comment has been minimized.

Copy link
Member

minrk commented Jul 16, 2019

#743 is the re-issue of #410

@meeseeksmachine

This comment has been minimized.

Copy link

meeseeksmachine commented Jul 16, 2019

This pull request has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/tip-embed-custom-github-content-in-a-binder-link-with-nbgitpuller/922/50

@betatim betatim referenced this pull request Jul 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.