# Keeping clean notebooks

In this notebook you will learn:

* How to use the notebook spellchecker included `hds_code` to reduce spelling mistakes in your notebook work.
* How to use Jupyter Lab's built in rulers to keep code clean
* How to break strings over multiple lines and how to call functions with a large number of parameters.
* How to use a package called `nbqa` combined with `flake8` and `black` to lint and autoformat code in a notebook
* How to write a PEP257 compliant docstrings

## Imports

> We are not going to use this import, you will see why later!

In [1]:
import numpy as np

## 1. Spelling checking markdown

The `hds_code` environment includes a Jupyter-lab extension called `jupyterlab-spellchecker`. We will use the following example to illustrate how it works:

The text kindly written by ChatGPT (note the wafflely style that is easy to spot!) contains the following spelling mistakes:

The six spelling mistakes in this paragraph are:

1. powerfull (should be powerful)
2. envirnment (should be environment)
3. consise (should be concise)
4. mantain (should be maintain)
5. guidlines (should be guidelines)
6. colaboration (should be collaboration)

You will see these highlighted when you edit the markdown below. Right click on the highlighted word and select **adjust spelling to** for suggestions.  Alternatively select **ignore** if the spelling is as desired.


### 1.1. Finding and fixing spelling mistakes in markdown

Jupyter Lab notebooks are powerfull tools for data scientists and researchers, offering a versatile envirnment for coding, analysis, and documentation. To structure your notebooks effectively, start by creating a clear and consise title that reflects the project's purpose. Organize your content into logical sections, using markdown cells for headings and explanations. It's crucial to mantain a balance between code and narrative, ensuring that your notebook tells a coherent story. Include comments within code cells to clarify complex operations, and consider using collapsible sections for lengthy outputs. Remember to clean and run all cells before sharing your notebook, as this helps prevent confusing results. By following these guidlines and maintaining consistent formatting throughout, you'll create well-structured notebooks that are easy to navigate and understand, enhancing colaboration and reproducibility in your work.

## 2. Line length

In python we aim to keep line length to 79 (or 80) characters. That is a line that exceeds this length breaks [PEP8 coding standards](https://peps.python.org/pep-0008/). More concretely - it becomes harder to read the code! This is partly due to (old) monitor size, but also coders usually have more than one window open at a time. Long unbroken lines of code can be hard to read.

### 2.1 Adding a ruler (vertical line) for code cells 

There is no real excuse for having extremely long lines of code. Python syntax and all major IDEs makes it easy to split lines. In Jupyter, make sure to head over to **Settings** menu and select **Notebook**.  Choose to **Add** in a ruler for code cells at either 79 or 80 characters. 

<img src="images/ruler.png" alt="ruler" style="width: 500px;"/>

### 2.2 Strings and line length

Strings can sometimes become very long. Here is a simple example where the **backslash** character is used to break up a string over multiple lines.


In [2]:
# splitting a string over multiple lines with the backslash
msg = (
    "Invalid parameter selections for hospital_id, ward_id and "
    "confidence_int.  Please select values with range provided in the main "
    "manual."
)
print(msg)

Invalid parameter selections for hospital_id, ward_id and confidence_int.  Please select values with range provided in the main manual.


### 2.3 Code and line length

In the data science libraries of python you will find that functions and classes tend to have a large number of mandatory and optional parameters. To keep your code readable you will need to split your code over multiple lines. For example, here is a dummy function that takes four parameters:

In [3]:
def rolling_forecast_origin(train, min_train_size, horizon=1, step=1):
    """dummy function with lots of parameters"""
    return None


y_train = []
min_train_size = 24
horizon = 6
step = 2

# call the function
cv = rolling_forecast_origin(train=y_train, min_train_size=min_train_size, horizon=horizon, step=step)

You can put each parameter on a separate line (or two per line). Note that the names of the variables are aligned.  See PEP8 for more details.

In [4]:
# alternative call style the maintains line length
cv = rolling_forecast_origin(train=y_train, 
                             min_train_size=min_train_size, 
                             horizon=horizon, 
                             step=step
)

There may be situations where your function or class exceeds line length guidance because it has a large number of parameters or they may have default values. For example in the `Normal` distribution class below 

In [5]:
class Normal:
    def __init__(self, mean=0.0, sigma=1.0, x_minimum=None, x_maximum=None, random_seed=None):
        pass

To correctly follow [PEP8 guidelines](https://peps.python.org/pep-0008/) you can use a hanging indent for each parameter. Note that you should add 4 spaces (an extra level of indentation) to distinguish arguments from the rest.

In [6]:
class Normal:
    def __init__(
        self,
        mean=0.0,
        sigma=1.0,
        x_minimum=None,
        x_maximum=None,
        random_seed=None,
    ):
        pass

## 3. Docstrings

If you create a python function, class, or module then you should provide a docstring to go with it. You can read more about docstrings in [PEP257](https://peps.python.org/pep-0257/). 

### 3.1 Docstrings in notebooks

I’ve provided a simple function example below (with the main code omitted). Note that use of the triple quotes to open and close the docstring. In this case my docstring consists of three sections:

1. A short description of the singular purpose of the function
2. A description of mandatory and optional parameters (and default values if applicable)
3. Details of the type of variable(s) returned when execution is complete.

> Note that sometimes code uses the "keyword arguments" in place of "Parameters".  **Whatever you choose be consistent.**

Functions and classes in clean notebooks should aim to have a good quality docstring. The complexity of these should be enough to describe the three points above. **But note** that you should also be using markdown and code cells to illustrate how the function/class is used. 

In [7]:
def multiple_replications(rc_period=1440, warm_up=0, n_reps=5):
    """
    Perform multiple replications of a computer simulation model
    of a hospital ward.  Returns results of each replication in tabular
    format.

    Parameters:
    ------

    rc_period: float, optional (default=1440)
        results collection period.
        the number of minutes to run the model beyond warm up
        to collect results

    warm_up: float, optional (default=0)
        initial transient period.  no results are collected in this period

    n_reps: int, optional (default=5)
        Number of independent replications to run.

    Returns:
    --------
    pandas.DataFrame
    """
    pass

### 3.2. A note on more complex docstrings (in modules or packages)

Docstrings can vary in length depending on the complexity of the code and if you intend for it to be reused by others. For example, from my `forecast_tools` package I include the following docstring with a function called `auto_naive` . This includes additional sections on:

* Raises - a list of exceptions that can occur when called
* See also - a list of related classes and functions
* Examples - pythonic code to test the function

In [8]:
def auto_naive(
    y_train,
    horizon=1,
    seasonal_period=1,
    min_train_size="auto",
    method="cv",
    step=1,
    window_size="auto",
    metric="mae",
):
    """Automatic selection of the 'best' naive benchmark on a 'single' series
    The selection process uses out-of-sample cv performance.
    By default auto_naive uses cross validation to estimate the mean
    point forecast peformance of all naive methods.  It selects the method
    with the lowest point forecast metric on average.
    If there is limited data for training a basic holdout sample could be
    used.
    Dev note: the plan is to update this to work with multiple series.
    It would be best to use MASE for multiple series comparison.

    Parameters:
    ----------
    y_train: array-like
        training data.  typically in a pandas.Series, pandas.DataFrame
        or numpy.ndarray format.
    horizon: int, optional (default=1)
        Forecast horizon.
    seasonal_period: int, optional (default=1)
        Frequency of the data.  E.g. 7 for weekly pattern, 12 for monthly
        365 for daily.
    min_train_size: int or str, optional (default='auto')
        The size of the initial training set (if method=='ro' or 'sw').
        If 'auto' then then min_train_size is set to len(y_train) // 3
        If main_train_size='auto' and method='holdout' then
        min_train_size = len(y_train) - horizon.
    method: str, optional (default='cv')
        out of sample selection method.
        'ro' - rolling forecast origin
        'sw' - sliding window
        'cv' - scores from both ro and sw
        'holdout' - single train/test split
         Methods'ro' and 'sw' are similar, however, sw has a fixed
         window_size and drops older data from training.
    step: int, optional (default=1)
        The stride/step of the cross-validation. I.e. the number
        of observations to move forward between folds.
    window_size: str or int, optional (default='auto')
        The window_size if using sliding window cross validation
        When 'auto' and method='sw' then
        window_size=len(y_train) // 3
    metric: str, optional (default='mae')
        The metric to measure out of sample accuracy.
        Options: mase, mae, mape, smape, mse, rmse, me.

    Returns:
    --------
    dict
        'model': baseline.Forecast
        f'{metric}': float
        Contains the model and its CV performance.

    Raises:
    -------
    ValueError
        For invalid method, metric, window_size parameters

    See Also:
    --------
    forecast_tools.baseline.Naive1
    forecast_tools.baseline.SNaive
    forecast_tools.baseline.Drift
    forecast_tools.baseline.Average
    forecast_tools.baseline.EnsembleNaive
    forecast_tools.baseline.baseline_estimators
    forecast_tools.model_selection.rolling_forecast_origin
    forecast_tools.model_selection.sliding_window
    forecast_tools.model_selection.mase_cross_validation_score
    forecast_tools.metrics.mean_absolute_scaled_error

    Examples:
    ---------
    Measuring MAE and taking the best method using both
    rolling origin and sliding window cross validation
    of a 56 day forecast.

    ```
    >>> from forecast_tools.datasets import load_emergency_dept
    >>> y_train = load_emergency_dept
    >>> best = auto_naive(y_train, seasonal_period=7, horizon=56)
    >>> best
    {'model': Average(), 'mae': 19.63791579700355}
    ```

    Take a step of 7 days between cv folds.

    ```
    >>> from forecast_tools.datasets import load_emergency_dept
    >>> y_train = load_emergency_dept
    >>> best = auto_naive(y_train, seasonal_period=7, horizon=56,
        ...               step=7)
    >>> best
    {'model': Average(), 'mae': 19.675635558539383}
    ```
    """
    pass

## 4. Linting and autoformatting notebooks

### 4.1 Linting notebooks

If you are new to coding (or even experienced!) you might be delighted to know that software exists to help you write cleaner code and keep to PEP8 standards. These are called code **linters**.

There are a number of linters you can choose from. Here I make use of **flake8**. I’ve always found this helpful.

To use flake8 with a Jupyter notebook requires another package `nbqa` ([Quality Assurance for Jupyter Notebooks](https://nbqa.readthedocs.io/en/latest/)). This can be installed from pip.

For example to run the linter with this particular notebook I would run the following in the terminal:

```bash
nbqa flake8 02_cleaner_notebooks.ipynb
```

I get the following output:

```
cleaner_notebooks.ipynb:cell_1:1:1: F401 'numpy as np' imported but unused
cleaner_notebooks.ipynb:cell_3:12:80: E501 line too long (101 > 79 characters)
cleaner_notebooks.ipynb:cell_3:13:1: E124 closing bracket does not match visual indentation
cleaner_notebooks.ipynb:cell_4:2:44: W291 trailing whitespace
cleaner_notebooks.ipynb:cell_4:3:60: W291 trailing whitespace
cleaner_notebooks.ipynb:cell_4:4:46: W291 trailing whitespace
cleaner_notebooks.ipynb:cell_4:6:1: E124 closing bracket does not match visual indentation
cleaner_notebooks.ipynb:cell_5:2:80: E501 line too long (94 > 79 characters)
cleaner_notebooks.ipynb:cell_6:1:1: F811 redefinition of unused 'Normal' from line 41
```

**Explanation**

The first line can be interpreted as follows:

* `cell_1` - indicated that the code that failed linting is in input cell 1.
* `F401` - the specific linting violation identifier. In this case module imported, but not used. (you can see the full list of violations [here](https://flake8.pycqa.org/en/latest/user/error-codes.html)
* `numpy as np` in this case the module that is not used in the notebook.

To **resolve** this error we would decide if the `numpy` import should be removed.  Avoid imports that are not needed.

The second line can be interpreted as follows:

* The violation is in input `cell_3`
* The violation is on line 12
* The code exceeds PEP8 line length (101 instead of 79 characters).

> To toggle line numbers on in a notebook: Select the cell. Press ESC. Shift-L.

The line of code in question is: 

```python
cv = rolling_forecast_origin(train=y_train, min_train_size=min_train_size, horizon=horizon, step=step)
```

To **resolve** this error we would use the line formatting option already discussed.

### 4.2 Autoformatting code in notebooks

To help with PEP8 compliance we can make use of a code formatter.  In the `hds_code` environment you can combined `nbqa` with `black`. This software will greatly improve the code format, but may not fully comply with PEP8. So it is often useful to run a linter afterwards to check.  

It can be run as follows in a terminal:

```bash
nbqa black 02_cleaner_notebooks.ipynb
```

**Caveats**

By default `black` uses a longer line length than 79. We can modify the line length parameter as follows:

```black
nbqa black 02_cleaner_notebooks.ipynb --line-length=79
```

> Note that black will note break strings for you. If you have long strings in functions/classes/cells I recommend you make use of section 2.2