Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different versions of R or staying just a point release behind the current version #661

Closed
ajstewartlang opened this issue May 2, 2019 · 21 comments · Fixed by #772
Closed

Comments

@ajstewartlang
Copy link

Proposed change

Allowing different versions of R (and possibly RStudio) to be run (R 3.6 vs. 3.5 vs. 3.4 etc).

Alternative options

Assuming that installing the most recent version of R would be less intensive that installing a pre-specified version, just keeping up to date with the major release versions would be a fantastic alternative - but maybe a point or two behind. R 3.6 has just been released - so maybe the most stable release is the previous version (3.5.3)

Who would use this feature?

Reproducibility has become increasingly important for researchers in Psychology and many groups and labs have switched to R for open and reproducible research (incl. in teaching). Given that reproducing the computational environment is arguably the gold standard of reproducibility, it would be hugely beneficial to be able to launch a particular version of R using repo2docker - so the runtime.txt file wouldn't just determine what version of packages are pulled from MRAN, but it could also contain a specification of which version of R to run.

I can imagine a future where research journal articles each contain a link in them so that reviewers and readers can launch a Binder to see the entire analysis script and data exactly as it was on the date the authors carried out the analysis. That would be a huge benefit to the community and a massive boon for reproducibility.

How much effort will adding it take?

I'm not sure I can estimate this.

Who can do this work?

I'm guessing someone who knows how to install different versions of R via Ubuntu’s package manager - and how repo2docker can read info in the runtime.txt file to determine which version of R to then install. I'm happy to help where I can - although I'm only a psychologist!

@betatim
Copy link
Member

betatim commented May 2, 2019

First of all: huge thank you for taking the time to fill in our (brand new!) issue template for this. I am already 200% more excited about getting this done :)

Second: I think we should do this. Letting users choose which version of R they need is essential and there is no good reason for why you can do it for Python but not R (modulo the fact that we need to find time to do it). A reason R support is a bit behind is that we lack R users to guide how repo2docker should do things. One of our guiding principles is that we want to follow what the respective communities are already doing, so we can pickup users where they are instead of prescribing how they should be doing it.

One idea that has been floated is to use the conda package manager to install different versions of the R binaries. You mentioned using what ever mechanism is in Ubuntu to install different versions. Is there another way? Otherwise we can look at the pros&cons of these two.

I like the idea of extending the format for runtime.txt to allow users to specify the version of R.

I think changing the version of R will be as much technical work as making it configurable (famous last words). Mostly because I think with the current setup we have, we can't actually change the version because we use what ever is "the version used with Ubuntu 18.04".

Things to do:

  • collect methods to install R on Ubuntu
  • pros&cons of each method
  • decide a format for runtime.txt to specify R versions and MRAN dates at the same time

@trallard
Copy link
Contributor

trallard commented May 3, 2019

I will take this up on! I have spent way too much time creating R containers in the past so this should help.

Although I like the idea of using conda to install R this tends to inflate the size of the images quite a lot.
And I would prefer not having this as the main method for those purely R based containers.

I have a quick question though: is there a reason for us using the Ubuntu base image as opposed to aa Debian one to build the R images? (note that this is purely out of curiosity).

@betatim
Copy link
Member

betatim commented May 3, 2019

Although I like the idea of using conda to install R this tends to inflate the size of the images quite a lot. And I would prefer not having this as the main method for those purely R based containers.

miniconda will already be installed because we install a Jupyter notebook server as the default frontend (R via notebooks) and to host our proxy that then sends you on to RStudio.

We should collect a few ways of installing the R binaries (not the R packages) in a Ubuntu base image that delivers all the things that a minimal repo2docker image delivers. Then we can look at how big the various images are and what we can tweak and what the trade off between size and extra engineering effort is etc.

is there a reason for us using the Ubuntu base image as opposed to a Debian one to build the R images?

All build packs in the core repo2docker setup share a base image to allow you to compose build packs. This is what allows us to produce images that have R and Python and X and Y installed. I don't think we should change either the ability to compose the build packs in repo2docker nor the base image we use for all of them. For enabling different build packs with different base images that (potentially) don't compose see #487 (comment).

I don't think we have a written down origin story why Ubuntu is the base image. To me Ubuntu is the more widely used on desktop version of Debian with packages that are more up to date.

@trallard
Copy link
Contributor

trallard commented May 4, 2019

We should collect a few ways of installing the R binaries (not the R packages) in a Ubuntu base image that delivers all the things that a minimal repo2docker image delivers. Then we can look at how big the various images are and what we can tweak and what the tradeoff between size and extra engineering effort is etc.

Cool I will start with this and report on the findings, we might want to also evaluate whether having multistage builds makes sense and helps us building lighter weight images

All build packs in the core repo2docker setup share a base image to allow you to compose build packs. This is what allows us to produce images that have R and Python and X and Y installed. I don't think we should change either the ability to compose the build packs in repo2docker nor the base image we use for all of them. For enabling different build packs with different base images that (potentially) don't compose see #487 (comment).

Agree, keeping a consistent base image should help with long-term maintainability

@ajstewartlang
Copy link
Author

Interesting to see a breaking change (in terms of reproducibility) in R 3.6 - specifically to do with the random number algorithm and the sample() function:

https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494

So, R scripts involving set.seed() and sample() will produce different results in R 3.6 vs. older versions of R. I think this is another bit of evidence as to why being able to specify which version of R to launch in Binder is important for full reproducibility. Twitter chat suggests this modification of the random number generation algorithm could cause confusion amongst those not aware of the underlying change when trying to run older scripts using R 3.6.

@cannin
Copy link

cannin commented May 20, 2019

Is there any suggested way of getting a particular r-base version, without a full Dockerfile, available directly from the R project site?

https://cloud.r-project.org/bin/linux/ubuntu/

@trallard
Copy link
Contributor

Not at the moment, this is what we are currently working on. For now the workaround would be to use a custom Docker (or any of the rocker project) t

@ajstewartlang
Copy link
Author

Thanks - I've been playing around setting up a Dockerfile in my repo using Rocker to run different versions of R. I've noticed that via this route, packages (such as ggplot2) load really quickly (and it doesn't look as if they're being built by Binder on the fly). Is the Dockerfile (containing the Rocker info). pulling a prebuilt version of packages into the Binder alongside base R or is the package build just happening super fast? I'm guessing it's the former as the occasional R Rocker (e.g., 3.5.3) doesn't seem to install the runtime.txt specified version of ggplot2 which - makes me think it's because there isn't a Rocker image of 3.5.3 with that version of ggplot2 pre-built (so the latest is installed by default). You can see the repo I'm using here:

https://github.com/ajstewartlang/Binder_demo

Either way, this seems like a nice workaround to using different versions of R and also getting certain packages up and running in the Binder quickly.

@trallard
Copy link
Contributor

trallard commented Jun 7, 2019

Hi @ajstewartlang yes the rocker image rocker/binder uses as base the rocker/geospatial (see here for reference) meaning that has all the dependencies needed by a wide number of packages, the whole lot of the tidyverse and the base for Shiny. This in brief means that:

  1. runtime.txt is completely redundant
  2. by default the latest versions of the packages are installed (as inherited from the rocker/base images and determined as the 'latest' at build time of the official rocker image).

Now to have more granular control of your dependencies and versions I recommend installing pacman via the install.R (`install.packages('pacman')) and then use this to install the libraries you need to pin specific versions.

@pat-s
Copy link

pat-s commented Aug 20, 2019

@ajstewartlang @trallard Shouldn't your problems be solved by using katthik/holepunch? For me it only did not suffice because I needed additional system dependencies.
You can set any R version in the dockerfile.

Apparently we're talking about different problems here and in holepunch:

  • repo2docker: Use eventually the rocker images for R based workflows instead of using a bionic image (current default?) (which is limited to one R version) and install many R packages from scratch
  • holepunch: Figure out how to install additional system dependencies. For some reason, simply adding something like
RUN export DEBIAN_FRONTEND=noninteractive; apt-get -y update \
 && apt-get install -y gdal-bin \
	libgeos-dev \
	libudunits2-dev \
	make \
	wget

in the docker file as in here did not work for me OOB.
If we find out how we can support the installation of additional apt packages using rocker images as the base in a docker file, holepunch will do the rest (I think).

@karthik
Copy link

karthik commented Aug 20, 2019

@pat-s I'm working to get holepunch stable for most use cases before allowing for adding additional system dependencies. If you file and issue there I can keep it on my list for the next set of fixes.

@betatim
Copy link
Member

betatim commented Aug 21, 2019

For repo2docker what ever solution we want to try has to result in a buildpack that lets users choose the version of R and be composable with other buildpacks so users can install (say) Python and R stuff simultaneously.

This means it isn't clear if we could use a different base image for the R buildpack, it would depend on it being "essentially the same" base image as the ubuntu one we use elsewhere.

Another option is to extend the commands here and afterwards so that they know how to install different versions of R.

@nuest
Copy link
Contributor

nuest commented Aug 21, 2019

I agree with @betatim that "use eventually the Rocker images" suggested by @pat-s will be hard to make work with r2d. Rocker images are Debian-based (so not too far away from Ubuntu), but I'm not sure what happens if multiple build packs are triggered on a different base image.

[IMHO copying how Rocker does it in the R buildpack without replicating the whole variety of Rocker images is a reasonable approach for r2d.]

A related issue about system dependencies: #762.

@pat-s
Copy link

pat-s commented Aug 22, 2019

@karthik The issue already exists: karthik/holepunch#20

@betatim @nuest I see. If there is the need to stay with the current image, I only see two options:

@betatim
Copy link
Member

betatim commented Aug 22, 2019

Another idea I just tested is to install the R binary from conda-forge. They seem to have all(?) versions available and we already use conda to install other packages. I tested this by taking https://github.com/binder-examples/r and adding a environment.yml that contained r-base. This gets you R 3.5 but there is also 3.6 available. It doesn't quite work, something during our installing of packages is looking in the wrong place or some such. I like this approach because it let's users choose the version of R they want independent of the MRAN date and we won't have to worry about how to extend runtime.txt in a backwards compatible way to let users who want to specify their R version. More work is required though to avoid having two versions of R installed and investigate what dependencies the conda R binaries pull in that the Ubuntu package doesn't (the conda ones bring in xorg packages which seems a bit excessive...)

I also couldn't find a PPA that has different R versions for Ubuntu. The R website itself only seems to have 3.6 now :-/

Do you have any experience with how long it takes to compile R from source?

@trallard
Copy link
Contributor

I have done the whole R installation from source multiple times (mainly on alpine) and it can take quite a bit.
A way to get around this might be to have a base repo2docker image (that compiles R from source) and then use that to build the rest of the images.
I can have a look at this after EuroScipy as I have some days blocked for OSS work

@pat-s
Copy link

pat-s commented Aug 22, 2019

Do you have any experience with how long it takes to compile R from source?

This takes up to 5 mins depending on the CPU speed. But everything that's > 1 min would probably not be acceptable. That's why the travis folks use some precompiled binary.

Idk anything about conda-forge but if that works for one version fixing others shouldn't be a big deal.

@jdblischak
Copy link

Chiming in with some more details on conda-forge. I'm happy to help test anything out as you are considering the various options.

Another idea I just tested is to install the R binary from conda-forge. They seem to have all(?) versions available and we already use conda to install other packages.

For conda-forge, we build the first patch of each minor release, e.g. 3.4.1, 3.5.1, 3.6.1, etc. This is because all the R conda binaries have to be re-built against each R release.

https://anaconda.org/conda-forge/r-base/files

More work is required though to avoid having two versions of R installed and investigate what dependencies the conda R binaries pull in that the Ubuntu package doesn't (the conda ones bring in xorg packages which seems a bit excessive...)

We added the xorg packages because R users were getting errors while trying to run analyses from minimal Docker containers, e.g. bgruening/docker-galaxy-stable#420

@betatim
Copy link
Member

betatim commented Aug 23, 2019

For conda-forge, we build the first patch of each minor release, e.g. 3.4.1, 3.5.1, 3.6.1, etc

Is this good enough? Could it be stepped up if there was demand? I am not a R user so no idea what the implications are of only(?) having the first patch release to choose from.

xorg topic

Do you know if the r-base ubuntu package also pulls in xorg? If yes that means we aren't adding anything new. If no we need to see how much bigger the resulting image gets.

@jdblischak
Copy link

Is this good enough?

I think it is good enough for most use cases. The differences between patch releases is minimal. The main issue occurs in the 3 months between the minor release and its first patch release. This problem is compounded if Bioconductor requires the new minor release.

Could it be stepped up if there was demand?

That's hard for me to say. The team of people that maintains the conda-forge R packages is small. However, since the patch releases are so similar, the main burden would be increased CI time. conda-forge recently switched to Azure, which has helped reduce CI wait times, but I don't know if it sufficient.

Another recent development that could help here is that conda-forge has started pinning R to only the minor release. See here for discussion. Thus going forward it is possible that we could provide all the patch level releases of r-base but still only build one binary of each R package for that minor release. As far as I know this potential has only been discussed, but no decision has yet been made.

Do you know if the r-base ubuntu package also pulls in xorg?

The Ubuntu r-base package does not pull in any xorg packages as far as I can tell. I ran the following to confirm:

$ docker run -it --rm ubuntu:xenial
apt list --installed *xorg*
apt update 
apt install r-base
apt list --installed *xorg*

@meeseeksmachine
Copy link

This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/binder-from-github-dockerfile-not-starting-rstudio/7986/21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants