Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Install APT packages before copying the repo contents #716

Merged
merged 2 commits into from Jun 25, 2019

Conversation

@betatim
Copy link
Member

betatim commented Jun 25, 2019

Split the assemble scripts into two categories. The default one which is as it was: executed after we copy the contents of the repository. A new one that build packs can opt into: executed just before we copy the content of the repository.

I also moved when we install stencila and some other R related setup things to the build scripts as they don't depend on the contents of the repo. This way they can be shared amongst all images.

This (the pre-assemble phase) could be useful for things like apt.txt which depends on the content of the repo (so shouldn't go in get_build_scripts()) but to execute the resulting command you don't need access to the contents of the repository. A counter example would be requirements.txt as the resulting pip install -r requirements.txt could need access to the repository.

Related to #707

@betatim betatim changed the title [WIP] Install APT packages before copying the repo contents [WIP] Install APT and R packages before copying the repo contents Jun 25, 2019
@betatim

This comment has been minimized.

Copy link
Member Author

betatim commented Jun 25, 2019

This is pretty sweet. I built repo2docker https://github.com/binder-examples/r once which takes a long time. Then I cloned that same repo to a local directory and did repo2docker /tmp/local_clone_of_example_r and it builds instantly because all layers are in the docker cache! Now, go edit the README.md in your local copy and rerun repo2docker /tmp/local_clone_of_example_r and witness it still being super fast!

This only works if we assume that install.R does not refer to anything in the repository itself when it gets executed. I think that is a safe assumption to make and we can write it down in our docs as an assumption. I guess that is an advantage of inventing a new format, we control it :)

Need to check if we can do something like this for environment.yml or not. requirements.txt is for sure out, or at least needs a smarter way of trying to perform this trick and failing gracefully.

@betatim

This comment has been minimized.

Copy link
Member Author

betatim commented Jun 25, 2019

WDYT @yuvipanda? before spending more time polishing this I am now pondering what the edge cases could be and when things would break for users.

Then sort out the string quoting business and add a comment to the docs saying that "install.R has no access to the repository contents."

@yuvipanda

This comment has been minimized.

Copy link
Collaborator

yuvipanda commented Jun 25, 2019

I think this makes total sense for apt.txt.

However, I don't think we should do this for install.R. it is arbitrary R code, and people will put arbitrary things there. We should treat it like postBuild.

@yuvipanda

This comment has been minimized.

Copy link
Collaborator

yuvipanda commented Jun 25, 2019

And thank you for working on this!

@betatim

This comment has been minimized.

Copy link
Member Author

betatim commented Jun 25, 2019

I think because we invented install.R we can also define that it can't contain arbitrary R code that needs access to the repo's content. It addresses a huge pain point users have which makes me want to be more pragmatic than usual :-/

As an alternative we could start versioning it and say something like "if the first line is # v2 then you will get faster rebuilds at the price of no access to the repo's contents". Without a magic comment you'd get the old behaviour and slow rebuilds.

@betatim

This comment has been minimized.

Copy link
Member Author

betatim commented Jun 25, 2019

Did a bit of looking at what people do put into install.R by running this query:

https://github.com/search?l=&q=filename%3Ainstall.R+pushed%3A%3E2018-01-01+language%3AR&type=Code

I didn't do an exact count but by page 7 of the results my impression is that for a large fraction of the install.R files are not in a place repo2docker would find them or they just contain install.library() commands. There are a few that do more complex stuff but those tend to be in repos that have a Dockerfile and don't look like they were made for repo2docker. There is just one repo that I found (Ben Marwick made some commits) that explicitly referenced install.packages('.') but this repo also had a DESCRIPTION file in it which we handle explicitly.

Starting to make me think that this is a change that is worth it because it speeds up things and breaks few repos.

@yuvipanda

This comment has been minimized.

Copy link
Collaborator

yuvipanda commented Jun 25, 2019

/cc @cboettig and @karthik who also probably have opinions.

I think because we invented install.R we can also define that it can't contain arbitrary R code that needs access to the repo's content. It addresses a huge pain point users have which makes me want to be more pragmatic than usual :-/

IMO, we invented the name only. The contents should work the same if they were executed on a user's computer without repo2docker - especially with the R extension. Otherwise it's going to lead to lots of confusion. I also do not know if installing R packages runs arbitrary code rather than anything declarative.

IMO, we should split this into two different PRs. The apt.txt one is a no-brainer, while I think install.R requires a lot more careful consideration, and input from the R community.

As an alternative we could start versioning it and say something like "if the first line is # v2 then you will get faster rebuilds at the price of no access to the repo's contents". Without a magic comment you'd get the old behaviour and slow rebuilds.

IMO this is the proposal we should consider. We have to be careful, since we're again inventing something extremely repo2docker / mybinder.org specific.

@yuvipanda

This comment has been minimized.

Copy link
Collaborator

yuvipanda commented Jun 25, 2019

My personal preference for speeding up R builds is to do something like https://github.com/pangeo-data/pangeo-stacks#customize-images-with-the--onbuild-variants but based on rocker images. This helps the R community self-serve. I've sortof started working on this in https://github.com/berkeley-dsep-infra/datahub/tree/staging/deployments/r/image, and will move it to using something like https://github.com/pangeo-data/pangeo-stacks/blob/master/onbuild/r2d_overlay.py soon.

This can totally happen in parallel to work here, though.

@cboettig

This comment has been minimized.

Copy link

cboettig commented Jun 25, 2019

This does seem reasonable, but like @yuvipanda I'm super wary of inventing custom config systems; particularly when the assumptions they operate with are likely to be a bit opaque to the average user.

So I think the most likely failure mode for this will be if the R user is also using something like packrat or renv (as we've discussed elsewhere here), since both of those packages operate on other files in the repo. Given the shared aims and the popularity of those packages I suspect that would come up here sooner or later too.

@kevinushey has written up some recent material about caching renv installs in Docker that might be useful, e.g. rstudio/renv#78, though it's not obvious to me if that would be applicable in the context of Binder.

@betatim betatim force-pushed the betatim:early-apt-get branch from bda4ebd to 41da674 Jun 25, 2019
@betatim

This comment has been minimized.

Copy link
Member Author

betatim commented Jun 25, 2019

I've removed the moving of install.R in the interest of a quick merge.

The contents should work the same if they were executed on a user's computer without repo2docker - especially with the R extension. Otherwise it's going to lead to lots of confusion. I also do not know if installing R packages runs arbitrary code rather than anything declarative.

Not sure I understand this. You can run the install.R on your computer and it will just work. However not all install.R that work on your computer will work with repo2docker. I think that is fine. I am pretty sure install.package('random-package') can execute arbitrary code, however random-package is not going to rely on the contents of the repo we are building. If the outcome of a package install is going to depend on the working directory from which it was executed we might as well just give up making stuff reproducible :).

We do have a problem if there is a install.package('.') in the install.R which is why I'd write in our docs that this isn't supported.

In the past (the whole point of the project even ;) ) we have mostly decided to serve the masses instead of making their life harder to please niche cases. In the trade-off between faster rebuilds for the masses and serving a niche audience, I think we should go for fast rebuilds.

renv

I don't see how this is related. Does renv use a install.R file? I thought it had its own set of files for storing what to install.

r2d overlay and rocker base images

How will this solve the problem that if we run the install.package commands after COPYing the repo contents we will have to re-execute them every time someone fixes a typo in their README?

@betatim betatim changed the title [WIP] Install APT and R packages before copying the repo contents [MRG] Install APT packages before copying the repo contents Jun 25, 2019
@cboettig

This comment has been minimized.

Copy link

cboettig commented Jun 25, 2019

What happens if a user puts renv::restore() in install.R?

@yuvipanda

This comment has been minimized.

Copy link
Collaborator

yuvipanda commented Jun 25, 2019

However not all install.R that work on your computer will work with repo2docker. I think that is fine.

IMO, this isn't something we should do. IIRC, this isn't true for any other file we support now. It also would be extremely confusing to a lay user.

@cboettig

This comment has been minimized.

Copy link

cboettig commented Jun 25, 2019

How will this solve the problem that if we run the install.package commands ...

I believe the idea was that fewer packages will need to be installed because many will already be available.

Note that while DESCRIPTION-specified dependencies work like that, the default install.packages() is a bit dumb in this respect. It will reinstall the requested package (even if said version is already installed), but even so it will save time since the dependencies of said packages won't be re-installed (e.g. if a user does install.packages("tidyverse"), tidyverse will reinstall but not it's 80+ dependencies in the dependency tree). Just a note that on that topic, @gaborcsardi & co have some pretty exciting work in https://github.com/r-lib/pak/ which provides a drop-in replacement for base install.packages() that does some things like caching and concurrent builds that can greatly speed up installs. I think it's not production-ready yet(?) but could provide an alternative strategy for faster installs for binder too.

@yuvipanda

This comment has been minimized.

Copy link
Collaborator

yuvipanda commented Jun 25, 2019

I think I still see the R changes - at least some stencila related R changes? Happy to merge the apt related bits though!

@yuvipanda yuvipanda merged commit f45088b into jupyter:master Jun 25, 2019
5 checks passed
5 checks passed
ci/circleci: build_docs Your tests passed on CircleCI!
Details
ci/dockercloud Your tests passed in Docker Cloud
Details
codecov/patch 90.47% of diff hit (target 20%)
Details
codecov/project Absolute coverage decreased by -0.06% but relative coverage increased by +0.3% compared to 781a136
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@yuvipanda

This comment has been minimized.

Copy link
Collaborator

yuvipanda commented Jun 25, 2019

yay for faster repo2docker! Thanks, @betatim

@betatim

This comment has been minimized.

Copy link
Member Author

betatim commented Jun 26, 2019

And because there is never anything new under the sun: some prior art #410 that we should look at and ponder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.