A template for bootstrapping reproducible RMarkdown documents for data journalistic purposes.
- Comes with cutting-edge, tried-and-tested packages for efficient data journalism with R, such as the
- Full reproducibility with package snapshots (thanks to the
- Runs out of the box and in one go, user doesn't have to have anything pre-installed (except R and maybe RStudio)
- Automatic deployment of knitted RMarkdown files (and zipped source code) to GitHub pages, see this example
- Code linting according to the
.gitignorewhich ignores shadow files, access tokens and the like per default
- Working directory is set "automagically" (thanks to @fin)
For more information please see the accompanying blog post.
First, clone and reset git repository.
git clone https://github.com/grssnbchr/rddj-template.git cd rddj-template rm -rf .git git init
If you have a remote repository, you can add it like so:
git remote add origin https://github.com/user/repo.git
How to run
The main document
main.Rmdlies in the folder
analysis. This is where most of your code resides.
Set config variables in the very first chunk, specifically:
package_date: This is the historical date of CRAN packages you want to use. Usually, you set this to the current date and leave it be. This way, further executions of the script will always use packages from this very date, ensuring reproducibility.
R_version: While specifying a package date is the first step for true reproducibility, you also need to tell people what R version you were using, for the sake of compatibility. For instance, R version 3.5.x probably won't work with packages released before May/June 2018. People who want to reproduce a script that you wrote in 2017, for instance, will have to install R version 3.4.x in order to ensure reproducibility.
options(Ncpus = x): People with multi-core machines can get a performance boost by specifying more than one core here. If you don't know the number of cores on your machine, set
- Run the script: The individual R chunks should be run in the interpreter (
Code > Run Region > Run All) on Linux/Windows: Ctrl+Alt+R, on Mac: Cmd+Alt+R). Be advised that some packages, like
rgdal, need additional third party libraries installed. Watch out for compiler/installation messages in the R console. Also, you need to have the
rstudioapipackages globally installed, e.g. installed via the RStudio package manager. On a Mac, occasional
y/n:prompts may show up in the R console during package installation (section "install packages") – just confirm them by pressing
yand Enter. Knitting the RMarkdown should not be done with RStudio (see below).
WARNING: It is recommended to restart R (
Session > Restart R) when starting from scratch, i.e. use
Session > Restart R and Run All Chunks instead of
Run All Chunks. If you don't do that,
checkpoint will be re-installed in your local
.checkpoint folder, or other errors might occur.
- Knitting the RMarkdown: Because of how RStudio and
checkpointworks, the use of the "knit" functionality in RStudio is strongly discouraged. It might work, but the preferred way is using the
knit.shshell script, execute it in a terminal like so:
./knit.sh. This will make sure the
rmarkdownpackage from the specified package date will be used, not the globally installed one.
knit.shknits the script into a html document
analysis/main.html. If you get an error saying that Pandoc could not be found, you need to let your terminal know where the
pandocbinary resides by adjusting the
PATHvariable. This holds true for both Linux and Mac OS. Pandoc comes with RStudio, and the binary usually resides in
/Applications/RStudio.app/Contents/MacOS/pandocrespectively. So add the respective directory to your path. Workaround without setting the
knit.shin the built in RStudio terminal (not the R console!) always works because RStudio obviously knows the location of the Pandoc binary. Knitting to PDF is currently not supported.
There are four branches at the moment:
- master: Uses R 3.6.x and packages as of 2020-10-07
- r-3.5: Uses R 3.5.3 and packages as of 2019-03-01
- r-3.4: Uses R 3.4.4 and packages as of 2018-04-01
- r-3.3: Uses R 3.3.3 and packages as of 2017-01-01
Use whichever you want.
|branch||Ubuntu 16.04||Ubuntu 18.04||macOS High Sierra||macOS Mojave||Windows 10|
|master (R-3.6.x)||not tested||not tested||not tested||not tested|
- 1: It may be necessary to reinstall the
curlpackage because of
libcurl. See https://github.com/grssnbchr/rddj-template/issues/9. Also, the compilation of
rgdalfails with GDAL 2.2.x.
- 2: On my setup,
devtoolscould not be installed in one go. First I had to install RTools. Then I had to manually
install.packages("debugme"); install.packages("pkgload"); install.packages("pkgbuild"); install.packages("devtools"), and then it would finally install
checkpointand work smoothly from there. If you encounter any errors while installing
devtools, have a close look at them and try to figure out what dependencies are missing, then install them manually.
- 3There were errors similar to 2. When prompted to choose between binary and source packages, I always typed in "y" and hit Enter. This way it worked for me.
This template uses the
checkpoint package by Microsoft for full package reproducibility. With this package, all necessary packages (specified in the
Define packages R chunk) are from a certain CRAN snapshot which you can specify in the very same R chunk (
package_date). For each
package_date, the necessary source and compiled packages will be installed to a local
.checkpoint folder that resides in your home directory.
This has two big advantages:
- All packages are from the same CRAN snapshot, i.e. are supposed to play nicely together.
- If you re-run your script two or three years after initial creation, exactly those packages that were used at that point in time, that work with your code you wrote back then, are loaded and executed. No more deprecated code pieces and weird-looking
In order to make
checkpoint work with
knitr, this vignette was adapted (it is now archived).
The downside(s) of
checkpoint, you can only access archived packages from CRAN, i.e. MRAN. As others have pointed out, GitHub repositories don't fit into this system. I wouldn't consider this as a big issue as you can install specific versions (i.e. releases/tags) from GitHub and as long as the GitHub repository stays alive, you can access these old versions. This is how the
checkpoint package itself is installed in this template, by the way:
devtools::install_github("checkpoint", username = "RevolutionAnalytics", ref = "v0.3.2")
A second possible disadvantage is the reliance on Microsoft's snapshot system. Once these snapshots are down, the whole system is futile. I reckon/hope there will be third party mirrors though once the system gets really popular. Update September 2017: Apparently you can roll your own checkpoint server.
Deployment to GitHub pages
The knitted RMarkdown may be deployed to a respective GitHub page. If your repository
repo is public, it can then be accessed via
https://user.github.io/repo (example: https://grssnbchr.github.io/rddj-template). In order to do that,
Make sure there are no unstaged changes in your working directory. Either
git committhem or
git stashthem before continuing.
Make sure you're in the root folder of your project (the one above
Then locally create a
gh-pagesbranch first, checkout master again and run the
deploy.shscript in the root folder:
git checkout -b gh-pages git checkout master ./deploy.sh
- For further deployments, it is sufficient to re-run
./deploy.sh. Make sure your working directory is clean before that step. If that is not the case, deployment will not work.
deploy.sh does the following:
pandoc. If that does not work, modify your
PATHvariable like so:
export PATH="$PATH:/usr/lib/rstudio/bin/pandoc"(tested on Linux).
index.htmlso it can be rendered by GitHub pages.
scriptsinto a zipped folder
rscript.zipso the repo can be easily downloaded by people who don't understand Git.
- Push everything to your remote
gh-pagesbranch (will be created if not existing).
- GitHub now builds the page and it should soon be accessible via
Linting / styleguide
Code is automatically linted with
lintr, i.e. checked for good style and syntax errors according to the tidyverse style guide. When being knitted, the
lintr output is at the very end of the document. When being interpreted, the
lintr output appears in a new
Markers pane at the bottom of RStudio. If you want to disable linting, just comment that last line in
Other stuff / more features
Versioning of input and output
output files are not ignored by default. This has the advantage that output can be monitored for change when (subtle) details of the R code are changed.
If you want to ignore (big) input or output files, put them into the respective
ignore folders. GitHub only allows a maximum file size of 100MB as of summer 2017.
Ability to outsource code to script files
If you want to keep your
main.Rmd as tidy and brief as possible, you have the possibility to put separate functions and other code into script files that reside in the
scripts folder. An example of this is provided in
Multiple CPU cores for faster package installation
By default, more than one core is used for package installation, which significantly speeds up the process.
Optimal RStudio settings
It is recommended to disable workspace saving in RStudio, see https://mran.microsoft.com/documents/rro/reproducibility/doc-research/
Installation of older R versions
The idea of this template is that you specify your currently used R version, and that people trying to reproduce your scripts will use that very same R version (or at least up to the two first version numbers, e.g. 3.4.x). This makes it necessary to install old R versions. Here's some advice on how to do that on a couple of OSes.
Debian (tested on Ubuntu 16.04 and higher)
- Download the required archive from here
- Untar and move it to the
sudo tar -xvf R-x.y.z.tar.gz -C /opt/src, this will create a new directory
- Change into that new directory and run
sudo ./configure --enable-R-shlib --with-cairo=yes --prefix=/opt/R/R-x.y.z(change placeholders!)
- Install some graphics dependencies
sudo apt-get install libcairo2-dev libgtk2.0-dev libtiff5-dev libx11-devif not already done.
- Compile it with
- Optionally run
sudo make check
- Install it with
sudo make install
- There should be an executable binary in
- In order to let your system know of that new R version and to be able to switch between alternatives, do this:
update-alternatives --list Rto see whether R is already registered with alternative versions
- If not, make a default alternative
sudo rm -rf /usr/bin/R && sudo update-alternatives --install /usr/bin/R R /usr/lib/R/bin/R 1000(this is probably the newest R version from the Debian package management system)
- Add the newly installed R version as alternative
sudo update-alternatives --install /usr/bin/R R /opt/R/R-x.y.z/bin/R 100
- Check with
update-alternatives --display R.
- From now on, you can easily switch between R versions doing
sudo update-alternatives --config R. Do this before you start RStudio (RStudio always uses the symlink in
/usr/bin/R). If there's a problem with a "broken" group, you can "force" the switch with the
--forceflag right after
- If the
update-alternativesswitch does not work for some reason, manually set a link with
sudo ln -sf /opt/R/R-x.y.z/bin/R /usr/bin/Rto switch to version
macOS X (tested on High Sierra and higher)
- First of all, you need to have at least one R version installed (probably the latest one).
- Navigate to r.research.att.com and download/install the so-called RSwitch GUI.
- Download the patched versions of the branch you want to install (earliest available branch is 3.3.) under this section.
- Extract the downloaded
*.tar.gzfile and move the folder
- Launch "RSwitch GUI" and switch between R versions (change is effective immediately, no need to restart RStudio, only R).
- Install all desired R binaries directly from r-project.org.
- RStudio (tested with 1.1.463) has a very convenient switch for R versions that can be found under Tools > Global Options > General > R version. After switching, restart RStudio.