Reproducible research

Replication is the cornerstone of science... There is a reproducibility crisis in many areas of Science (more refs here and here ...).

To reproduce, we need access to data and the code that processed and analysed it. But there is also poor data analysis, misusing/misunderstanding statistics, pressure to publish.

What can we do?

D Knuth: Literate programming is a methodology that combines a programming language with a documentation language, thereby making programs more robust, more portable, more easily maintained, and arguably more fun to write than programs that are written only in a high-level language.

D. Donoho, as pointed out by John Claerbout: 'An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.'

R. Gentleman and D. Temple Land, Statistical Analyses and Reproducible Research 2004. We introduce the concept of a compendium as both a container for the different elements that make up the document and its computations (i.e. text, code, data, ...), and as a means for distributing, managing and updating the collection.

Different levels of reproducibility

(There is some ambiguity when it comes to nomenclature. Here, I'll use the one that is most common in biological/bioinformatics, based on, among others V. Stodden and R. Peng - see references.)

Reproducibility/reproduce

A study is reproducible if there is a specific set of computational functions/analyses (usually specified in terms of code) that exactly reproduce all of the numbers in a published paper from raw data.

Replication/replicate

A study is only replicable if you perform the exact same experiment (at least) twice, collect data in the same way both times, perform the same data analysis, and arrive at the same conclusions.

A study might be reproducible (and this might not be an easy thing to achieve, in particular for large-scale studies), this does not make is replicable. The latter is stronger than the former. Replicability requires new samples and new data (*), which introduces new variability, and additional risks of errors. Reproducibility is, to some extent, a technical challenge, while replication gives the results scientific validity.

(*) in particular biological replicates - see statistics lecture on experimental designs.

Re-usability

Applying/adapt the software/methodology to a new but related question.

Other terms: mechanical reproducibility, statistical reproducibility, repeat/repeatability, ...

Why

Reproducible/replicable science is better science. And it's even better if science is openly shared, and reproducible/replicable by others.

But also

Five selfish reasons to work reproducibly
1. Reproducibility helps to avoid disaster
2. Reproducibility makes it easier to write papers
3. Reproducibility helps reviewers see it your way
4. Reproducibility enables continuity of your work
5. Reproducibility helps to build your reputation
There is a need for a culture of reproducibility. It is up to you to establish one for yourselves and your work.

Tools

R and `Sweave`/`knitr`

Input: A text document with text and code chunks (that computer, generates tables and figures). Example here.
Extract the code chunks and execute the code.
Replace the code chunks by their respective outputs.
Compile the text document into a final format, such as pdf or html.

In R, one would use Sweave or the knitr package.

Starting from markdown and R code (Rmarkdown)
Starting from LaTeX and R code

Example:

library("knitr")
knit("report.Rmd") ## produces report.md
library("rmarkdown")
render("report.md", ouput_format = "pdf_document") ## produces report.pdf
render("report.md", ouput_format = "html_document") ## produces report.html

or, more directly

render("report.Rmd", ouput_format = "pdf") ## produces report.pdf

(You will get such an example in the statistics practical and will use RStudio.)

Jupyter notebook

Previously known as IPython notebooks: python, R, Julia, and many more. Interactive, in browser.

orgmode

org-mode: specific to emacs, many languages.

Automation using `make`

make is an automated build system, designed to avoid costly recomputation.
make examines a Makefile, which contains a set of rules describing dependencies among files.
A rule is run if the target is older than any of its dependencies.
Older: compare creation time of files.

report.md: report.Rmd
	Rscript -e "knitr::knit('report.Rmd')"

report.pdf: report.md
	Rscript -e "rmarkdown::render('report.md', output_format = 'pdf_document')"

report.html: report.md
	Rscript -e "rmarkdown::render('report.md', output_format = 'html_document')"

Version/track your research

Source code versioning systems such as git, subversion, hg, ... and their web interfaces (GitHub, Bitbucket) are invaluable to (1) track change over time, (2) save all intermediate states of your work and (3) enable/facilitate collaborative work.

Reproducibility and the conduct of research

(as pdf)

Figure taken from the report of the symposium, ‘Reproducibility and reliability of biomedical research’, organised by the Academy of Medical Sciences, BBSRC, MRC and Wellcome Trust in April 2015. The full report is available from http://www.acmedsci.ac.uk/researchreproducibility.

Best practice

The tools described above are fantastic way to make ones computational research reproducible. But it generally not enough. As computational researchers, or even wet lab scientists that reply on some scripting or programming, it is important to follow some best practice to make our work more efficient, more tractable, this making is easier to reproduce.

Wilson G et al. Best Practices for Scientific Computing (2014).

Write programs for people, not computers.
Let the computer do the work.
Make incremental changes.
Don't repeat yourself (or others).
Plan for mistakes.
Optimize software only after it works correctly.
Document design and purpose, not mechanics.
Collaborate.

Prev: Open Science -- Next: Conclusions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

03-rr.md

03-rr.md

Reproducible research

What can we do?

Different levels of reproducibility

Reproducibility/reproduce

Replication/replicate

Re-usability

Why

Tools

R and `Sweave`/`knitr`

Jupyter notebook

orgmode

Automation using `make`

Version/track your research

Reproducibility and the conduct of research

Best practice

Files

03-rr.md

Latest commit

History

03-rr.md

File metadata and controls

Reproducible research

What can we do?

Different levels of reproducibility

Reproducibility/reproduce

Replication/replicate

Re-usability

Why

Tools

R and Sweave/knitr

Jupyter notebook

orgmode

Automation using make

Version/track your research

Reproducibility and the conduct of research

Best practice

R and `Sweave`/`knitr`

Automation using `make`