# Jupyter notebook guidelines

_Jesús Fernández_ · Instituto de Física de Cantabria (IFCA), CSIC-Universidad de Cantabria

These are guidelines to write Jupyter notebooks and other code with a common style to improve readability and reuse.

## Notebooks vs scripts

 * **Notebooks** provide an environment to combine rich text (formatting, links, formulas), programming code and the output generated by the code (figures, tables, ...)
 * **Scripts** are pure code that allow plain text comments and produce output in separate files or on the terminal
 * If you are not willing to write rich text explanations for your code, you better start coding a script rather than a notebook.
     * It can always be converted later into a notebook
     * Easier to develop and test
     * Much lighter for the version control system (git)
 * In any case:
     * Write in (proper) English the text, comments and code
     * Take your time to follow these guidelines
         * The best way to waste your time (and that of others) is to do the things "the fast way".

## Jupyterlab tips and tricks

 * Learn the shortcuts. You will retain the most useful ones
     * [List](https://discourse.jupyter.org/t/most-useful-keyboard-shortcuts-for-notebook-lab/18113) of shortcuts ()
     * Add a new cell below (b) or above (a)
     * Change cell type to markdown (m) or code (y)
     * Split the cell at the cursor position (Ctrl Shift -)
     * Merge cells (Ctrl Shift m)
     * Copy (c) or cut (x) and paste (v) full cells
     * Delete a cell (d d)
     * Undo cell-level operation (z)
 * Use the contextual help (Ctrl i)
     * Inspect variable contents
     * Find function arguments
 * Use the notebook console
     * Assign a shortcut to notebook:run-in-console
 * Use the script console to execute the script (Shift Enter)
 * To download a rendered image -> Shift right-click

## Notebook text

 * Use hierarquical sectioning (`#`, `##`, `###`, ...) even if you don't like the size of the resulting headers.
     * Do not skip section levels
     * Choose meaningful section titles
     * Take time to think of the titles. There is no rush.
     * Do not repeat section titles in the same notebook (anchors are automatically created)
     * Do not use section header markers for styling your text. Use **boldface**, _italics_, or `code` markers.
 * Do not enter comments in the code. Use the notebook rich text features
     * You can refer to the `variables` and their meaning in the text

## Notebook (or any) code

 * Load all required libraries **at the top** of your notebook/code
     * Sort them alphabetically
     * Use standard abbreviations
 

In [1]:
import pandas as pd
import xarray as xr

 * Define all required parameters **at the top** of your notebook/code
     * Parameters are just variables that will not change along the code
     * Never assign a parameter with other value in the middle of the code
     * Parameterize your code as much as posible
     * Use [papermill](https://papermill.readthedocs.io/en/latest) to sweep parameters
     * For lots of small parameters, use a YAML or JSON config file
 

In [2]:
variable = 'tasmax'
input_path = 'EUR-12/UCAN/ERA5/evaluation/r1i1p1f1'

 * Avoid _magic numbers_ in the middle of your code
     * Compute them from existing variables or assign a new one
 * Use meaninful and complete variable and function names
     * Avoid shortnames easy to forget
         * `urban_th`? `urban_threshold`
     * Good variable/function names avoid a lot of code documentation
         * ```python
         # Domain resolution
         res = 0.11
         ```
         * Takes more space than:
         ```python
         domain_resolution = 0.11
         ```
     * Do not name variables after their _specific_ content
         * ```python
         data_august = data.sel(month = 8)
         data_madrid = data.sel(city = 'Madrid')
         ```
     * Take time to think of the names. There is no rush.
         * If you think of a better name afterwards, do not hesitate to search (**complete words!**) & replace all
         * ... but try to think of a good name from the start
     * Use the autocomplete feature (Tab) to avoid misspelling variables
 * Avoid overwritting objects with different content, especially in different cells
     * Code depends on cell execution order

In [3]:
my_data = [2, 17, 5]
# ... several cells below ...
my_data = sorted(my_data)

 * Use functions
     * They save from repeated coding
     * The moment you repeat a similar code twice, consider a function, a new variable, or a loop
     * Even if used just once, functions:
         * hide complexity
         * improve readability and save documentation work
     * Functions should solve simple, focused tasks
     * Combine functions as necessary to accomplish complex tasks
     * Look for existing functions! (load additional packages, ask colleagues)
     * Look for extra arguments of the functions you already use!

In [4]:
figure_name = f'{input_path.split("/")[2]}.pdf'

vs.

In [5]:
gcm_name = input_path.split("/")[2]
figure_name = f'{gcm_name}.pdf'

vs.

In [6]:
def get_gcm_name(path):
    return(path.split('/')[2])

figure_name = f'{get_gcm_name(input_path)}.pdf'

vs.

In [7]:
def parse_ESGF_path(path, start_dir = 'project_id', project='CORDEX-CMIP6'):
    if project.lower() == 'cordex-cmip6':
        # See https://github.com/WCRP-CORDEX/cordex-cmip6-cv/blob/1a4f10b/CORDEX-CMIP6_DRS.json#L3
        directory_path_template = "<project_id>/<mip_era>/<activity_id>/<domain_id>/<institution_id>/<driving_source_id>/<driving_experiment_id>/<driving_variant_label>/<source_id>/<version_realization>/<frequency>/<variable_id>/<version>"
    elif project.lower() in ['cordex', 'cordex-cmip5']:
        # See 5.3 in https://is-enes-data.github.io/cordex_archive_specifications.pdf
        directory_path_template = "<activity>/<product>/<Domain>/<Institution>/<GCMModelName>/<CMIP5ExperimentName>/<CMIP5EnsembleMember>/<RCMModelName>/<RCMVersionID>/<Frequency>/<VariableName>"
    elif project.lower() in ['cmip6']:
        # See https://github.com/WCRP-CMIP/CMIP6_CVs/blob/192cfc9fa674069d92b5549d7f7eea049441af6e/CMIP6_DRS.json#L5
        directory_path_template = "<mip_era>/<activity_id>/<institution_id>/<source_id>/<experiment_id>/<member_id>/<table_id>/<variable_id>/<grid_label>/<version>",
    else:
        print(f'Unknown project: {project}')
    directory_path_template_elements = (directory_path_template
        .replace('<', '')
        .replace('>', '')
        .split('/')
    )
    split_path = path.split('/')
    start_dir_index = directory_path_template_elements.index(start_dir)
    paired_elements = directory_path_template_elements[start_dir_index:start_dir_index+len(split_path)]
    return(dict(zip(paired_elements, split_path)))

drs_elements = parse_ESGF_path(input_path, start_dir = 'domain_id', project='cordex-cmip6')
figure_name = f'{drs_elements["driving_source_id"]}.pdf'

 * Avoid fancy unnecessary language resources/libraries
     * magrittr
         * `my_data_frame %>% head` vs `head(my_data_frame)`
     * list comprehension and lapply for simple loops not requiring a list as output
         * use `for`
 * Avoid long lines. Keep them under 80 characters
 * Avoid absolute local paths (e.g. `/lustre/gmeteo/...`)
     * They do not work for others
     * You disclose internal paths
     * Link them to your project folder and use relative paths
     * Consider using rprojroot / pyprojroot

## Notebook output

 * Avoid long outputs
     * Redirect them
     * Show just the head
     * Write figures to disk instead of showing them all
 * Write figures to disk even if shown on the notebook
 * Cache the output of long calculations or calculations requiring huge input data 

 * Try to produce publication-ready figures from the beginning
     * Otherwise, figures can be hard to interpret or misinterpreted
     * Choose appropriate colorbars
         * Use simetric, divergent colobars for anomalies
         * Do not use divergent colorbars for absolute values
         * Reverse colorbars as necessary to respect warm/cold and wet/dry impression
     * Choose common, appropriate colorbar ranges
         * Use the same range in similar plots
         * Burn extreme-valued areas as necessary to highlight your results
         * Use nonlinear scales (only) if necessary
     

## Notebooks and git

 * Notebooks are plain text files (JSON). Keep them under version control
 * Images are text-encoded and take a lot of space
    * Avoid having many images in your notebook
    * Show low resolution images and save to disk (git-ignored) high resolution/vectorial versions
 * Always run the full notebook before commiting to the repository
    * I.e. cells are numbered sequentially starting from 1
    * Code might work just because of a particular cell execution order during development
    * Make sure it works as intended when run in full
 * Try to use the same environment for development
    * Prevents unintended small version changes that will show up in your diffs
 * Save your environment along with the code
    * E.g. as an `environment.yml` file for conda
 * Try the jupyterlab-git plugin
    * Nice diff feature
    * Check for unintended changes before commiting
    * Commit directly to your repository