# Jupyter Notebooks

## What are Notebooks?

Jupyter Notebooks are interactive, web-based tools that allow developers and data scientists to combine code, visualizations, and narrative text in a single document. 

They are ideal for data analysis, experimentation, and educational purposes. The notebook format enables users to write and execute code in cells, view outputs immediately, and create a narrative around their work, making it accessible and understandable for both technical and non-technical audiences. This makes Jupyter Notebooks a powerful platform for collaboration, teaching, and sharing insights.

Useful for:
- EDA (exploratory data analysis)
- creating AI model POCs
- anything where you want visibility of every step of the process or want to build up and adjust a process without needing to rerun the whole flow from the start

Less useful for:
- creating production ready code
- working on code that a number of people will want to be editing at the same time

So within these bounds, lets look at various ways we can make our jupyter lives easier and more collaborative when working in a team environment.

## Documentation

Never a developers favourite thing. Data scientists are marginally better but can often make a lot of assumptions about people's knowledge of the domain problem being explored.

Note: I am a huge documentation fan, don't come at me with your "self-documenting code" arguments.

Recommendations:
- add a documentation block at the top of the notebook that describes the aims and outcomes of the notebook
- if you're building out a process or model, document the unspecified dependencies (data requirements, model lifecycle requirements etc) at the top. Include any:
  - gotchas
  - assumptions
  - work still to complete
- link to any relevent external documentation

## Structuring the Notebook

### Navigating the Notebook
- break the notebook into sections
  - put setup and config at the top, so it's obvious what the dependencies are and what the main setup is
  - use logical sections with markdown headings
- collapse sections to reduce scroll / eye overwhelm
- magic methods can help to show method content from elsewhere

### External Modules
These are especially useful where notebooks in different projects utilise the same methods, e.g. for reading from a database or applying standardised data manipulations.

- encourages code reuse, so each notebook isn't reinventing the wheel in slightly different ways
- notebooks can use the same modules as production code, so you can be sure for those methods that the research and production code are in sync
- tidies the main flow, so that the notebook contains the code specific to this particular model or analysis
- if you are importing modules that you are changing whilst running the notebook, you can use autoreloading to update the notebook as soon as that external module as been resaved rather than having to restart everything

## Gotchas
- state
- easy to accidentally save secrets or config only applicable to you
- saving and creating checkpoints. It can be easy to accidentally close a notebook without hitting save, in which instance you will lose the latest changes

## Code Reviews
- `.ipynb` are essentially JSON files, which makes them _very_ hard to check in a pull/merge request
- for notebooks that need to follow a peer review process for the coding elements, use an extension such as JupytText to be able to parse the changesets much more easily

## Saving Cell Output
Sometimes you want to save your notebooks with all the outputs in place, so that they are easily shareable and tell the story that you want. But if the notebooks are going into source control and contain references to customer data, this is often not desired.

This is another thing that Jupytext can help with. You can configure you .gitignore to exclude the `.ipynb` files and just include the `.py` files, that don't include the output.

Other options are Git pre-commit hooks to strip the output.

## Getting Code to Production
Things you'll need to think about when porting code to a live production system:
- notebooks usually run off static CSVs, for easily repeatable experiments. A live production system will be running off data that is constantly updating, so will require a lot of data validation checks to ensure that there is enough workable data to actually run the model
- production systems do not allow you to eyeball the output of a process as it goes along, so you'll need additional validation around the outputs of the code, such as model accuracy tracking and sanity checks around data, if its meant to be within expected ranges for example
  