R in Jupyter with Binder
An example of how to use R in Jupyter notebooks and then make a Binder environment to run them interactively on the web. This repo was inspired from a Tweet in a discussion about Episode 7 of The Bayes Factor podcast.
Disclaimer: I am a physicist and primarily a Python and C++ programmer and I don't use/know R. This repo is just what I know from being able to read code and understanding how Jupyter works.
Check it out first
Before learning how to setup R in Jupyter, first go check out how cool it is in Binder! Just click the "launch Binder" badge above.
Table of contents
- Setup and Installation
- Enable package dependency management with packrat
- Using papermill with Jupyter
- Testing Jupyter notebooks with pytest
- Automating testing alongside development with CI
- Setting up a Binder environment
- Preservation and DOI with Zenodo
- R Markdown in Jupyter with jupytext
- Further Reading and Resources
Before you can begin you need to make sure that you have the following in your working environment
Setup and Installation
- Install Jupyter
- If you aren't familiar with Python and
pipthen just follow the instructions for installing with Anaconda
- If you aren't familiar with Python and
- Install R
- Install the R kernel for Jupyter (IRkernel)
The first step in any project should be making sure that the library dependencies are specified and reproducible. This can be done in R with the packrat library.
First install packrat
R -e 'install.packages("packrat")'
Then from your project directory initialize Packrat for your project
R -e 'packrat::init()'
which will determine the R libraries used in your project and then build the list of dependencies for you. This effectively creates an isolated computing environment (known as "virtual environments" it other programming languages).
packrat::init() results in the directory
packrat being created with the files
packrat.opts inside of it. It will additionally also create or edit a project
.Rprofile and edit any existing
.gitignore file. The following files should be kept under version control for your project to be reproducible anywhere:
As you work on your project and use more libraries you can update your dependencies managed by
packrat with (inside the R environment)
packrat/packrat.opts. You can also check if you have introduces new dependencies with
If you remove libraries from use that were managed by
packrat you can check this with
and then remove them from
to ensure the minimal environment necessary for reproducibility is kept. Checking the status should now show
packrat::status() #> Up to date.
If you have a
packrat.lock file that you want to create an environment from that doesn't already exist you can build the environment by running (from the command line)
R -e 'packrat::restore()'
This is one way in which you could setup the same
packrat environment on a different machine from the Git repository.
papermill with JupyterUsing
Papermill is a tool for parameterizing, executing, and analyzing Jupyter Notebooks.
This means that you can use papermill to externally run, manipulate, and test Jupyter notebooks. This allows you to use Jupyter notebooks as components of an automated data analysis pipeline or for procedurally testing variations.
- To use with Jupyter notebooks running in the IRkernel install the R bindings for papermill (
A toy example of how to use papermill is demonstrated in the example Jupyter notebook.
To provide testing for Jupyter notebooks we can use pytest in combination with papermill.
Once you have installed pytest and done some minimal reading of the docs then create a
tests directory and write your test files in Python inside of it.
An example of some very simple tests using papermill is provided in
tests/test_notebooks.py. Once you read though and understand what the testing file is doing execute the tests with
pytest in the top level directory of the repo by running
To see the output that the individual testing functions would normally print to
stdout run with the
Why write tests?
There are numerous reasons to test your code, but as a scientist an obvious one is ensured reproducibility of experimental results. If your analysis code has unit tests and the analysis itself exists in an automatically testable pipeline then you and your colleagues should have more confidence in your analysis. Additionally, your analysis becomes (by necessity) a well documented and transparent process.
Why test with pytest?
pytest is the most comprehensive and scalable testing framework that I know of. I am biased, but I continue to be impressed with how nimble, powerful, and easy it is to work with. It makes me want to write tests. For the purposes of this demo repository it is also important as it allows for writing tests that use papermill (papermill's
execute_notebook is only accessible through the Python API).
There are testing frameworks in R, most notably testthat, which I assume are good. So I would encourage you to explore those as well.
Automating testing alongside development with CI
Assuming that you're using Git to develop your analysis code then you can have a continuous integration service (such as Travis CI or CircleCI) automatically test your code in a fresh environment every time you push to GitHub. Testing with CI is a great way to know that your analysis code is working exactly as expected in a reproducible environment from installation all the way through execution as you develop, revise, and improve it. To see the output of the build/install and testing of this repo in Travis click on the build status badge at the top of the
README (also here: ).
To start off with I would recommend using Travis CI (it is the easiest to get up and running).
- Getting started with Travis CI
.travis.ymlin this repo
- Travis CI docs on writing YAML CI files for R
Access restrictions and hosting
There may be instances where you want to have your Git repository be private until work is complete or other information is made publicly available, and you still want to be able to use CI services.
Travis CI (currently) only works with GitHub and is free only for public repositories. CircleCI works with any Git web hosting service (i.e., GitHub, GitLab, Bitbucket) and allows for free use with public and private repositories up to a monthly use time budget. Additionally, GitLab offers their own CI service that is integrated into the GitLab platform. If your organization self-hosts an instance of GitLab (GitLab is open core) then you can use those CI tools with your private GitLab hosted repositories. If your organization has access to the enterprise version of GitLab then you can even run GitLab CI on GitHub hosted repositories.
Binder environmentSetting up a
Binder turns your GitHub repository into a fully interactive computational environment (as you hopefully have already seen from the demo notebook). It then allows people to run any code that exists in the repository from their web browser without having to install any code and is a great tool for collaboration and sharing results.
The Binder team has done amazing work to make "Binderizing" a GitHub repository as simple as possible. In the case of getting an R computing environment many times all that you need (in addition to a
DESCRIPTION file and maybe an
install.R) is a
runtime.txt file that dictates which daily snapshot of MRAN to use. See the
binder directory for an example of what is needed to get this repository to run in Binder.
- Specifying an R environment with a runtime.txt file
- Binder FAQs
You'll note that the "launch Binder" badge at the top of the
README automatically launches into the
R-in-Jupyter-Example.ipynb notebook. This was configured to do so, but the default Binder behavior is to launch the Jupyter server and then show the directory structure of the repository.
Once the server loads click on any file to open it in an editor or as a Jupyter notebook.
To further make your analysis code more robust you can preserve it and make it citable by getting a DOI for the project repository with Zenodo. Activating version tracking on your GitHub repository with Zenodo will allow it to automatically freeze a version of the repository with each new version tag and then archive it. Additionally, Zenodo will create a DOI for your project and versioned DOIs for the project releases which can be added as a DOI badge. This makes it trivial for others to cite your work and allows you to indicate what version of your code was used in any publications.
R Markdown is a very popular way to present beautifully rendered R along Markdown in different forms of documents. However, it is source only and not dynamically interactive as the R and Markdown needed to be rendered together with Pandoc (Pandoc is awesome).
jupytext is a utility to open and run R markdown notebooks in Jupyter and save Jupyter notebooks as R markdown.
Once you have installed jupytext create a Jupyter config with
jupyter notebook --generate-config
which creates the config file at
Add the following line to the Jupyter config
c.NotebookApp.contents_manager_class = "jupytext.TextFileContentsManager"
If you now launch a Jupyter notebook server and open a
.Rmd file the R Markdown should now be rendered in the interactive environment of Jupyter!
R Markdown in Jupyter in Binder
To get R Markdown working in Binder simple create a
requirements.txt file in the
binder directory and add
jupytext to it. Binder should take care of the rest!
- Here's a minimal example using the
Example_Rmd.Rmdfile from this repository:
Further Reading and Resources
- Jupyter And R Markdown: Notebooks With R, by Karlijn Willems
- Rocker's R configurations for Docker repo
- Noam Ross, "Docker for the UseR", New York Open Statistical Programming Meetup (nyhackr) (July 11th, 2018)