# Jupyter notebooks for reproducible research

_Alexander Konovalov_

Research Software Group

School of Computer Science

University of St Andrews

## Quote from classics ...

* **Practices in source code sharing in astrophysics**, L.Shamir at al., Astronomy and Computing, vol. 1, Feb. 2013, 54–58 (https://arxiv.org/abs/1304.6780)
* One of the references is almost 24 centuries old: **On Interpretation** by Aristotle
* _“One of the important advantages of releasing source code is that it allows replication of the results, which is a key concept in science (Aristotle, 350BC).”_

## Your mileage may vary

* Have you been frustrated by trying to use someone else’s code which is non-trivial to install?
* Have you tried to make supplementary code for your paper to be easily accessible for the reader?
* This may require non-trivial efforts


## Notebook interfaces

* Combine code, results, text and graphics in the same document
* Available kernels: https://github.com/jupyter/jupyter/wiki/Jupyter-kernels
* Lower some barriers to reproducibility
* But introduce some own challenges

# What is a notebook?

In [1]:
a=21 # code cell without output

In [2]:
print(a) # code cell with output

21


**This** is a _markdown_ cell. You can use LaTeX too: $21 \times 2 = 42$

In [3]:
# could contain many lines
def double(x):
    '''double the argument'''
    return 2*x

In [4]:
double(a)

42

## 10 simple rules

* **Ten simple rules for writing and sharing computational analyses in Jupyter Notebooks** by Adam Rule, Amanda Birmingham, Cristal Zuniga, Ilkay Altintas, Shih-Cheng Huang, Rob Knight, Niema Moshiri, Mai H. Nguyen, Sara Brin Rosenthal, Fernando Pérez, Peter W. Rose 
    * https://doi.org/10.1371/journal.pcbi.1007007
    * Earlier preprint: https://arxiv.org/abs/1810.08055

* The auhors consider barriers, opportunities and challenges, and tools for reproducible computational research

## Spaghetti notebooks?

* One of the studies found out that only a small fraction of Jupyter notebooks mentioned in PubMedCentral publications is runnable without problems with accessing data, resolving dependencies and using different platforms

* Another analysis of over a million of Jupyter notebooks publicly available on GitHub found that about 25% of them do not have any text

    - even those with text rarely contained detailed description of the steps or interpretation of results
    
* Are we going from undocumented code to undocumented notebooks?
* We can and should do better!

## Aspects of notebook development
* Organise and document
* Work with code
* Share

## Organise and document:
    
1. Tell a story, and match it to the audience
2. Document the process, not just the results
3. Get the right balance while splitting code and text between cells

## Work with code
4. Write modular code
5. Document dependencies
6. Use version control
7. Establish a pipeline

## Share
8. Share and explain your data, not only the results
9. Enable your notebooks to be read, run and explored
10. Advocate for reproducible and open research!

## Some examples on Binder

* Python: https://github.com/alex-konovalov/repro-jupyter

* Sagemath: https://github.com/OpenDreamKit/demo-semigroup-representation-theory

* GAP: 
    * https://github.com/OpenDreamKit/gap-demos
    * https://github.com/gap-system/try-gap-in-jupyter

* GAP + Travis CI + Codecov:
    * https://github.com/sukru-yalcinkaya/unipoly
    * https://github.com/rse-standrewscs/gap-binder-template