## How to organise large project? How to keep it reproducible?

- Order of notebooks execution matters:
  - for small projects prefixing notebook names with a digit is great
     - `01_Clean_data.ipynb`
     - `02_Explore_covariates.ipynb`
     - `03_Survival_analysis.ipynb`
     - having multiple steps at a single level is fine, just document the convention in the README
       - `03_Survival_analysis.ipynb`
       - `03_Secondary_outcomes.ipynb`
  - for medium-size projects:
     - hierarchical directories
  - for large projects:
     - automated execution of all notebooks
     - maintain an ordered list of notebooks to run, or
     - infer the order based on the inputs and output files

## Example of a medium-sized project

![](media/example_project.png)

## Automated execution of large projects

Recommendation (my way):

1. Use input/output system for data. Examples for Python:
  - [scrapbook](https://github.com/nteract/scrapbook)
  - [data-vault](https://github.com/krassowski/data-vault)
2. Analyse notebooks for syntax loading/saving the input/output
3. Create directed acyclic graph to run notebooks in the optimal order
4. Automate notebooks execution in an separate environment
5. Compare the resulting notebooks with your current ones to catch reproducibility issues

Relatively easy to create using `papermill` to run Jupyter notebooks.

### Example implementation

[nbpipeline](https://github.com/krassowski/nbpipeline) + [data-vault](https://github.com/krassowski/data-vault)

![](media/nbpipeline_example_interactive_result.png)

![](media/nbpipeline_example_diff.png)

## Where to put things?

#### Data in tabular format

- changes are easier to track (even on GitHub)

- easy to analyse in a different program

- try adopting CSV/TSV files to:
  - discourage color coding (difficult to work with and error-prone)
  - encourage one <s>table</s> file = one <s>relation</s> piece of data rule
  - avoid common date/time issues and issues with gene/metabolite names

#### Code in dedicated files

- easy to write unit tests for

- easy to reuse for other projects

- which encourages proper documentation

- can be edited in any editor, including the best industry standard (e.g. PyCharm for Python)

- separates implementation details from the critical research "flow"

- a slide with example coming...

#### Videos and images in a separate media folder

- easier to access from other software

- dedicated tools for comparing images (diffs)

- faster load time of notebooks (on your computer and online)

### Moving code out of the notebook - advice

In [2]:
from operations import add

- to show the usage and documentation use `?`:

In [3]:
add?

[0;31mSignature:[0m [0madd[0m[0;34m([0m[0ma[0m[0;34m:[0m [0mint[0m[0;34m,[0m [0mb[0m[0;34m:[0m [0mint[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m Will add a and b, which both should be integers
[0;31mFile:[0m      ~/notebooks-for-biomedical-research/operations.py
[0;31mType:[0m      function


- to also include the code use `??`:

In [4]:
add??

[0;31mSignature:[0m [0madd[0m[0;34m([0m[0ma[0m[0;34m:[0m [0mint[0m[0;34m,[0m [0mb[0m[0;34m:[0m [0mint[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
[0;32mdef[0m [0madd[0m[0;34m([0m[0ma[0m[0;34m:[0m [0mint[0m[0;34m,[0m [0mb[0m[0;34m:[0m [0mint[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""Will add a and b, which both should be integers"""[0m[0;34m[0m
[0;34m[0m    [0;32mreturn[0m [0ma[0m [0;34m+[0m [0mb[0m[0;34m[0m[0;34m[0m[0m
[0;31mFile:[0m      ~/notebooks-for-biomedical-research/operations.py
[0;31mType:[0m      function


- to display documentation only use `inspect.getdoc`:

In [5]:
from inspect import getdoc

In [6]:
getdoc(add)

'Will add a and b, which both should be integers'

- to display source use `inspect.getsource`:

In [7]:
from inspect import getsource
getsource(add)

'def add(a: int, b: int):\n    """Will add a and b, which both should be integers"""\n    return a + b\n'

> "But my function uses two parameters and a DataFrame that I loaded earlier in the notebook"

- pass them in function arguments instead

- you get to name the arguments now; use this to explain the intent better!

> "But I would need to provide the same argument every time I use this function - will I have to copy-paste?"

In [15]:
from operations import add

In [21]:
add(2, 5)

7

In [22]:
add(7, 5)

12

In [23]:
from functools import partial

In [24]:
add_five = partial(add, b=5)

In [25]:
add_five(2)

7

In [26]:
add_five(7)

12

- partial is your friend (functools::partial in Python, purrr:partial in R)