## What are notebooks?

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.14.0/css/regular.min.css" integrity="sha512-qgtcTTDJDk6Fs9eNeN1tnuQwYjhnrJ8wdTVxJpUTkeafKKP6vprqRx5Sj/rB7Q57hoYDbZdtHR4krNZ/11zONg==" crossorigin="anonymous" />

<div style="font-size: 20px">

<i class="fas fa-align-justify ic"></i> text + <i class="fas fa-code"></i> code + <i class="fas fa-photo-video"></i> outputs = <img src="media/Jupyter_logo.svg" width=20 style="position: relative; top: 5px; display: inline"> notebook<br><br>

<i class="fas fa-align-justify"></i> text + <i class="fas fa-code"></i> code = <img src="media/R_logo.svg" width=25 style="display: inline"> Markdown

</div>

Showcase:
- [Lung Cancer Post-Translational Modification and Gene Expression Regulation](https://nbviewer.jupyter.org/github/MaayanLab/CST_Lung_Cancer_Viz/blob/master/notebooks/CST_Data_Viz.ipynb?flush_cache=true) - heatmaps
- [An open RNA-Seq data analysis pipeline tutorial with an example of reprocessing data from a recent Zika virus study
](https://nbviewer.jupyter.org/github/maayanlab/Zika-RNAseq-Pipeline/blob/master/Zika.ipynb) - plotly
- [Population Genetics in an RNA World](https://nbviewer.jupyter.org/github/gocarli/RNA-Popgen-Notebook/blob/master/Population_Genetics.ipynb) - equations

For other examples from across various scientific disciplines check the [gallery of interesting Jupyter Notebooks](https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks)

## Different kinds of notebooks

### R Markdown/R Notebook (RStudio)

- the two terms are almost identical when used **within RStudio**

- Markdown document with code

- outputs not saved with the notebook

- notebook together with outputs can be exported ("knited") to a number of different files types:
 - PDF
 - HTML
 - Word document 

- predominantly R, other languages somewhat supported (e.g. Python with or without reticulate)

- tied to RStudio
  - but can be rendered in any Markdown viewer!
  - few other editors support R Markdown

- widgets: htmlwidgets (standalone) and shiny (reactive; dashboard server)

### Jupyter notebooks

- JSON file (needs a dedicated viewer)

- multiple language "kernels":
  - JuPyteR = Julia, Python, R
  - C++, Java, MATLAB, Perl, Scala, Mathematica & 135+ more  

- multi-lingual cells using cell "magics"

- can be exported to:
 - PDF
 - HTML
 - Markdown
 - slides 

- multiple editors (Jupyter Notebook, JupyterLab, VSCode, Spyder, PyCharm, ...)

- widgets: ipywidgets (both interactive and standalone), Voilà (dashboard server)

### Other notebooks solutions

- Netflix PolyNote: inspired by Jupyter, multi-lingual by default

- Jupyter-based
  - Google Colab,
  - JetBrains Datalore,
  - Microsoft Azure Notebooks,
  - CoCalc  

- Mathematica (Wolfram) Notebooks: predecessor of Jupyter

## Short tour of notebooks features

## Resources

- writing reproducible papers using R Markdown: [resulumit.com/blog/rmd-workshop](https://resulumit.com/blog/rmd-workshop/)
- reproducible research with Jupyter: [reproducible-science-curriculum.github.io/workshop-RR-Jupyter](https://reproducible-science-curriculum.github.io/workshop-RR-Jupyter/)

## Searching for a balance

We all (should) want our analysis to be:

- easy to write
- easy to read

What do you find more challenging:

- writing your own code
- reading code written by someone else
- reading your own code
- reading your own code from three years ago

## Balancing challenges in scientific research

How to keep the project:

- easy to maintain?

- easy to understand?

- easy to reproduce?

- on track to meet the deadlines?

## What notebooks are good for?

Best for:

 - <i class="fas fa-search"></i> exploration 

 - <i class="fas fa-book"></i> story telling 

### Moving code out to a dedicated file - why not?

> "I want the reader to understand how this function makes a decision"

- `?` - "okey this is not enough, I want to show the actual *code*"

In [15]:
from operations import add

In [21]:
add(2, 5)

7

In [22]:
add(7, 5)

12

In [23]:
from functools import partial

In [24]:
add_five = partial(add, b=5)

In [25]:
add_five(2)

7

In [26]:
add_five(7)

12

- partial is your friend (functools::partial in Python, purrr:partial in R)

## Notebooks replicability and biomedical research

### FDA Title 21 CFR Part 11

- standards required from groups submitting applications to FDA

- includes: validation, audit trial (detailed history, timestamps), record retention
 - good audit trial = records every step of the analyst in a tamper-resistant way

- you are likely not bound by it

- commercial software DOES offer 21 CFR compliant Audit Trial support
 - e.g. SIMCA for metabolomics analyses

### Computational notebooks are not 21 CFR compliant *by default*

- You can execute code cells in any order

- The history of execution (and changes) is not kept with the file

- R Markdown enforces correctness of code before export

- Jupyter notebooks can be exported in any state

- Jupyter notebooks assume you are a responsible person, you need to act like one

- You still can use:
  - extensions to add timestamps
  - continuous integration to re-run your notebooks every time
  - a notebooks platform build with replicability as a main goal
  - linters, tests and diffs to prevent mistakes in the first place

**While common notebooks tools are not designed for reproducible research, it is easy to make your own notebooks replicable.**

For example, when using R Markdown, it is a good practice to write out the entire session information at the very end of the document. You can do that calling `sessionInfo()`.

### General FDA notes
- at current only for-profit companies developing closed-source proprietary software invest in full FDA compliance (e.g. SAS)

- providing compliance difficult for rapidly evolving open-source software
- validation issues somewhat addressed in:
   - a huge progress on the R core in May 2018, see: [Regulatory Compliance and Validation Issues: A Guidance Document for the Use of R in Regulated Clinical Trial Environments](https://www.r-project.org/doc/R-FDA.pdf)
     - only core + some chosen packages
     - CRAN 
     - tidyverse? maybe in the future
   - RStudio has a largely similar document: [RStudio: Regulatory Compliance and Validation Issues](https://rstudio.com/wp-content/uploads/2014/06/RStudio-Commercial-IDE-Validation.pdf)   

- audit trial issues... well:

> R is not intended to create, maintain, modify or delete Part 11 relevant records but to perform calculations and draw graphics.
>
> Where R’s use may be interpreted as creating records, however, R can support audit trail creation within the record
>
> R includes `date()`, `Sys.time()`, `Sys.Date()` and `Sys.timezone()` functions  which  enable users to include date and time stamps on report, graphical and other output, thus enabling the use of this information in the tracking of user sessions.

   - Any Jupyter extension that you use which will be automatically and reliably adding timestamps could be better than that!

## Advice on working with notebooks

- start a notebook by defining its scope with:
   - a description (what is this notebook doing?)
   - list of aims (what are deliverables? any questions to be answered?)
   - (optional) list of non-aims (if needed to distinguish similar notebooks with otherwise overlapping scope)

- separate data preparation, data exploration and the final analyses notebooks

- DRY: if you find yourself writing the same code at the beginning of each notebook:
   - create a file with the repetitive setup commands
   - execute it in the first cell of every notebook:
     - for Jupyter notebook: `%run notebook_setup.ipynb`
     - for R Markdown: `source('init.R')`

## Three tools that may help you

1. Tests
 - check if the code performs as you would expect
 - different from validation: works as expected + expectations are correct (a task for an entire team!)
 - at minimum, prevents breaking the properly functioning code when trying to *improve* it

### Tests example

In [27]:
def divide(a, b):
    try:
        return a / b
    except ZeroDivisionError:
        return 'Warning: division by zero'

In [28]:
def test_divide():
    assert divide(1, 2) == 0.5
    assert divide(10, 2) == 5


test_divide()

In [29]:
assert divide(1, 0) == 'Warning: division by zero'

2. Linters
  - check the code as you write it
  - can only catch some obvious mistakes and typos (syntax errors)
  - often can teach you good *style* (i.e. the conventions adopted by other language users)
  

3. Diffs
  - newbie mistake: git as a back-up solution
  - comparing changes in code between every version is a standard practices
  - what about comparing notebooks? how do you diff them?

**The three alone will not make your code perfect, but it's a good step forward**

Getting your code towards perfection:
- thoughful design of interfaces,
- domain-specific knowledge of the actual science (your code may be working as expected, but what if the expectations were wrong?)
- in-depth and easy to understand (e.g. hierarchical) documentation

## Tests in notebooks

- It is possible to:
  - write tests in notebooks
  - execute tests in notebooks
  - convert entire notebooks into tests suites  

- Should you? Maybe.
  - all of the above is a use-case departing from story telling and research
  - if write for other biomedical researchers, they want a story to read (minimal code, no tests)
  - everyone else looking specifically for the tests (e.g. a stray research software engineer) will first check the tests directory, rather than open each notebook

- Personal recommendation:
  - do NOT store tests (code written to verify other code) in notebooks
  - DO use assertions frequently in notebooks

Good assertions examples:
- sanity checks, e.g. `assert all(patients.age > 0)`
  - protects you in case if the data changes in an unexpected way
  - gives the reader a signal that the frame patients has a column "age"
- assumption checks (for statistical models, or to prevent duplicate index entries in pandas)
- unique values check e.g. `assert set(patients.smoker) == {'Yes', 'No', 'Unknown'}`
  - a weak assumption which should also hold if new data points are added

In [109]:
from pandas import Series

age = Series([1, 4, 2, 3])

In [31]:
assert all(age > 0)

**Excercise**: Write some tests/write a function so that tests pass (two versions R/Python).

## Linters

### Python

General:
- fun fact: `import this`
- PEP8, flake8, pycodestyle, mypy....
- mypy: static type checking

In Jupyter notebooks:
- early days (extension required/limitted functionality)
- VScode
- [jupyterlab-lsp](https://github.com/krassowski/jupyterlab-lsp)

### R

General
- lintR
- styleR

In R Notebooks:
- by default in RStudio

In Jupyter notebooks:
- early days...
- [jupyterlab-lsp](https://github.com/krassowski/jupyterlab-lsp)

**Excercise**: Clean-up provided code to remove all warnings, two versions (R and Python)

## Diffs

Goals:

1. highlight why certain things are better kept out of the notebooks
2. demonstrate how to generate diffs for notebooks

- **Excercise 1**: Note differences between two provided notebooks
- **Excercise 2 - optional**: Find an offending commit which introduced a bug on GitHub using the blame function

## Publishing notebooks

## Dealing with patients data

- You most likely work with the anonymized data to begin with
  - you (of course) keep the data locally, probably password-protected and outside of the repository (/have a gitignore file), 
  - and you do not print out entire dataset at once, just to be certain
- But would it be ok if the data became public at any point without proper vetting?
  - anything you push to an online repository is at a higher risk, even if it's private
- Case study: a patient from minority background + recruited at the other, smaller hospital you work with + after six pregnancies.
  - could this patient flag up as outlier? If yes, why?
     - possibly for any of the three characteristics, let alone the sum of them
  - would this patient be easier to identify than others?
     - how many persons with such characteristics **and** the disease you study live there?
  - if both answers are yes, you may want to put an extra effort to avoid revealing any additional information, e.g. which metabolites they had up/down

### Mitigation strategies
- Proper inspection of diffs is one way to control what is being committed to your repository - even annonymized data should be taken a with a great care.
- If you want to check for any accidental change in the data without printing it out (or adding to the version control) it might be better to use a checksum as demonstrated with MD5 example\*

\*) while cheksums are not unique, to get one collision you would need to generate random data in a rate of ["6 billion per second for 100 years"](https://stackoverflow.com/a/288519). It might be that there are more pressing issues to worry about in such time scales.

## Dealing with data

- Passing data around
  - recommendation: do NOT use pickle (Python) or RDS (R)
    - can only be accessed with Python/R
    - neither is backwards compatible by default
    - cannot be diffed with git not viewed on GitHub
  - do use plain tabular format (CSV/TSV) or any domain-specific format (e.g. VCF/FASTA/PDB):
    - can be opened in any relevant software
    - easy to track changes
- Versioning data
   - do not store data in the same git repository
     - large files make git difficult to work with after some time
     - you may want to keep data more privat than the analyses/summaries
   - have a separate repository for data
     - do not push it to external service provider (GitHub/GitLab etc)
     - do push it to a dedicated server within your organization
   - or, use artifact versioning
     - [quilt](https://github.com/quiltdata/quilt) - versioned data portal for AWS, petabyte-scale, notebook-oriented
     - [scrapbook](https://github.com/nteract/scrapbook) - records the cell outputs (graphics, tables) separately from the notebook