# Notebook Etiquette

## Introduction

Why a session on notebook etiquette? Because it is important, really important! Knowing when to use which tool in your programming arsenal makes the difference between completing a task in 1 day vs taking 5 days and still things are wonky.

<img src='https://www.dataquest.io/wp-content/uploads/2019/01/interface-screenshot.png' width='60%'>

In this notebook we'll look at some common mistakes I've seen people make in notebooks as well as inform you of some sneaky, cool and sometimes very useful add-ons that you can use in your notebooks.

<img src='https://code.visualstudio.com/assets/updates/1_37/icons.gif' width='60%'>

IDE's like Atom, VS Code, Sublime Text are great, and they have their place, don't get me wrong. Similarly, notebooks are great, but also have their place along with pro's and con's for certain tasks.

The most important difference for me personally, is that notebooks should be _more_ than code, much more! If you just want to run your code from top to bottom, then use a script - seriously you'll be beter off. No more forgetting about which cells ran first or having to deal with variables having rando values. 

Notebooks are more than scripts! And should be treated as such. Think of a notebook as a report or an essay, or even a piece of art, if you like. Eitherway, your notebooks are a piece of you that you put into the world and you should be proud of them. Difficult to do if it's just code cells that barely run top to bottom... 

You might be thinking, but I just want to code and get the job done. And I hear you. However, I would like to convice you that spicing up your notebooks don't need to take any _more_ time than you're currenlty spending on your notebooks. How? Easily, you know those times where you are waiting for data from a DB - scroll up to the top of your notebook and **start refactoring!**

Not only is this great small excerises in refactoring code, but your notebook just keeps improving everytime your queries start running. In other words, no extra time spent on your notebook, but the quality has increase tremendiously.

## Naming and structure 

### Naming

There are no perfect solutions here, but the idea is for you and your team to choose a format **and stick to it!**

A good suggestion I've encountered and I've standarised my workflow to include, is naming notebooks in the format: `date_user_name`. For example, this notebook's name is:

<br>
<div style="text-align: center">
21-01-2020_louwrens_notebook-etiquette.ipynb
</div>
<br>

There are a few reasons for this naming convension - especially if you're working in teams. Originally, I tried to keep track of which notebook follows on which by including a number before hand, as in: `1_name1.ipynb` and `2_name2.ipynb`. As soon as I started working in a team and people were naming 3 notebooks starting with `1_` it wasn't a viable option anymore. With the above format, the progression of notebook development is also captured by the date, and along with the user it is easy to trace back which notebook follows on which.

Another great advantage of standising not only notebooks, but all your team's files in this format, is that you can filter notebooks easily in the CLI. For example, if I wanted all the notebooks for January I could pipe my notebook folder to grep. I've created a example-notebooks directory with 4 files:

In [1]:
ll example_notebooks/

total 0
-rw-r--r--  1 louwjlab  staff  0 Jan 21 14:11 03-01-2020_louwrens_a-notebook.ipynb
-rw-r--r--  1 louwjlab  staff  0 Jan 21 14:11 10-01-2020_ruan_another-notebook.ipynb
-rw-r--r--  1 louwjlab  staff  0 Jan 21 14:11 21-12-2019_louwrens_another.ipynb
-rw-r--r--  1 louwjlab  staff  0 Jan 21 14:11 29-12-2019_ongama_yet-another.ipynb


We could filter these notebooks for notebooks only written by louwrens:

In [2]:
!ls -ltra example_notebooks/ | grep --color=auto louwrens

-rw-r--r--  1 louwjlab  staff    0 Jan 21 14:11 21-12-2019_[01;31m[Klouwrens[m[K_another.ipynb
-rw-r--r--  1 louwjlab  staff    0 Jan 21 14:11 03-01-2020_[01;31m[Klouwrens[m[K_a-notebook.ipynb


### Structure

Agian, there is no perfect structure. But formalising one for yourself helps! At least for the things that you'll always be doing, like importing, setting constants, etc.

Currently I'm using the following structure:

```
# Title
    ## Introduction
    ## Setup
        ### Imports
        ### Config
        ### Variables
    ## Getting Data
    
    ## The meat, cause you never know where it's going
    
    ## Export/Checkpoint
    ## Conclusion 
```

If you aren't using jupyter notebook extensions, I highly suggest you do. You can install it by following the instruction on their github repo: https://github.com/ipython-contrib/jupyter_contrib_nbextensions

<img src='https://raw.githubusercontent.com/Jupyter-contrib/jupyter_nbextensions_configurator/master/src/jupyter_nbextensions_configurator/static/nbextensions_configurator/icon.png' width='80%'>

Two of my favorite extensions are `Table of Content` and `Collapsable Headings`. If give each heading of a section its own markdown cell, then you can collapse these sections using the `Collapsable Headings`, and with the `Table of Content` extension navigating your notebook is a brease. For example:

### Section A

Words 

#### Sub Section A

Blah

#### Sub Section B

Blah

#### Sub Section C

Blah

### Section B

More words

### Final Section

Words

## How to Use markdown effectively

Did you know you can use **any** html in notebooks? It litterally a website. If you want to embed an image of a cat, use an `<img>` tag:

```html
<img src='https://www.animeoutline.com/wp-content/uploads/2018/06/anime_cat_drawing.png' width='30%>
```

<img src='https://www.animeoutline.com/wp-content/uploads/2018/06/anime_cat_drawing.png' width='30%'>

Oooo 😳, that's some nice HTML syntax highlighting... Remember when you are in a markdown cell and you use the 3 backticks (\`\`\`) for a block of code, you can state which languge it is at the top of the three ticks, for `html` example:

````    
```html
<img src='https://www.animeoutline.com/wp-content/uploads/2018/06/anime_cat_drawing.png' width='30%>
```
````             

renders as:

```html
<img src='https://www.animeoutline.com/wp-content/uploads/2018/06/anime_cat_drawing.png' width='30%>
```

and, 

````    
```python
import numpy as np 
print(np.pi)

def a_function(arg1):
    pass
```
````             

renders as:

```python
import numpy as np 
print(np.pi)

def a_function(arg1):
    pass
```

Other common things that will just set your noteobooks appart is using markdown tables, instead of leaving your pandas dataframe hanging - more on this later. 

Markdown tables have the basic format:

```
|Normal Header| Right Aligned Header  | Centered Header |
|-------------| --------------------: | :-------------: |
|Normal Cell  | Right Cell            | Centered Cell   |
|Normal Cell  | Right Cell            | Centered Cell   |
```

Colons can be used to align columns.


|Normal Header| Right Aligned Header  | Centered Header |
|-------------| --------------------: | :-------------: |
|Normal Cell  | Right Cell            | Centered Cell  |
|Normal Cell  | Right Cell            | Centered Cell  |


There must be at least 3 dashes separating each header cell.
The outer pipes (|) are optional, and you don't need to make the 
raw Markdown line up prettily. You can also use inline Markdown.

Markdown | Less | Pretty
--- | --- | ---
*Still* | `renders` | **nicely**
1 | 2 | 3

You'll try and make your own markdown table, and hate it! Because you don't create your own markdown table, you generate it using: https://www.tablesgenerator.com/markdown_tables or convert a csv to markdown table using https://www.convertcsv.com/csv-to-markdown.htm. Or you can just use an HTML table.

The main value of markdown tables is also their downfall. Their content doesn't change when your data changes. Although this might sound like a downside, sometimes it is actually favourable, as you start aiming to write more determenistic notebooks. Notebooks where the output is constant no matter who runs it - saving the dataset your using in your notebook to a pickle is highly adviced - also this will stop you waiting to query data every time. Altough we've seen that too isn't that bad as it improves your notebooks...

For more markdown hints, check out the cheatsheet: https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet.

## Variables

We've all been there. Running a cell in a notebook and the output is not at all what it sound be as we've re-run so many other cells that the values have been mixed up and we need to run everything again. This wastes a lot of time you could be spending doing something else, like improving your notebook... 😉

So how to get around this, place all _varying_ variables at the top in the `Setup` section and only overwrite variables in the notebook if you _really_ have to, but creating new variables is ideal... Again don't create new variables for everthing - find a balance. 

Personally, I don't mind writing some complex pandas pipeline and assign the result to a new DataFrame, but take a step back and think what makes logical sense to group in each pipeline. 

A good example of this is when your have a datetime windows, what I suggest you do is, on the top of your notebook declare 2 variables, `start_time` and `end_time` and save the output of your DB query using these variables, for example:

```python
start_time = '2020-01-02 00:00:00'
end_time = '2020-01-05 00:00:00'

sql = f'''
...
WHERE DATETIME >= {startime}
AND DATETIME <= {end_time}
'''

filename = Path(f'./data/output_{startime}_{end_time}.pck')

if filename.exists():
    print('file exists, reading in')
    df = pd.from_pickle(filename)
else:
    print('file not found, querying')
    df = query_db(sql)
    df.to_pickle(filename)s
```

## Passwords

You have 2 choices here, but I'll advocate using the later. You can use the `getpass` module to ask for your password when you run your notebook, but then you loose the ability to just hit `Kernel > Reset & Run All`, which should be a go to for your notebook, btw. 

In [3]:
from getpass import getpass

In [4]:
db_user = 'pyman'
db_password = getpass(f'Password for: {db_user}')

Password for: pyman········


In [5]:
db_user, db_password

('pyman', 'asdf')

Rather, use the `configparser` module and store usernames and passwords in a config file. Just remember to place the config file in your `.gitignore`, otherwise it defeats the purpose of not pushing passwords to git.

In [6]:
from configparser import ConfigParser
from pathlib import Path

In [7]:
config_file = Path('config.ini')
assert config_file.exists(), '{config_file} doesn\'t exist.'
config = ConfigParser()
config.read(config_file)

config['OPTIMA']['USER'], config['OPTIMA']['PASSWORD']

("'pyman'", "'@sdF'")

## When to stop a notebook?

When it has reached it's purpose set out in the Introduction. The same reasoning applies to writing notebook, as for writing functions. Write the docstring first! How do you know when your function is done if you don't know what it should do. For example, instead of writing

```python
def calc_area(a,b):
    return a*b
```

and then straight away jumping into codeing, rather first define the purpose of the function:

```python
def calc_area(a,b):
    """
    Function that calculates the area of a and b if a and b are the edges of a rectangle.
    
    Parameters
    a (int): First side of rectangle
    b (int): Second side of rectangle
    """
```

Instantly we've gotten our thought together and realised that `calc_area` is maybe a bad name, perhaps `calc_area_rect`, or `calc_area(a,b,type='rect')` could work better. But the idea is that if you define what your function must do before you write it, you don't juse mindlesly jump in and start tinkering. Rather, you know where you're going, and more importantly when you've arrived.

Simlarly, with a notebook, in the Introduction state what you would like to achieve with this notebook, before you start writing it. That way, you know when you're done.

It is then advised that, especially if you are going on with analysis on the same dataset, that you make a checkpoint dataframe, something like `./data/21-01-2020_louwrens_notebook-etiquette_df.pck`. That way you can just read in this pickle in your next notebook, and you don't have to rerun all the code in this notebook. But, if you would like to rerun the analsyis for a different period, then you can rerun both notebooks in order. 

**Remember!** Add all `./data` directories to `.gitignore` and all pickle extensions, `.pck, .pkl, etc.`. Why? Let me explain:

### Overview of how git works

When you `commit` a file with `git` it **hashes** the all the contents of that file, so a file with content:

###### myfile.txt
```
a
b
c
```

might get hashed into something like: `asdfa393asd9fasd9asdfb9asd9f`. 

This will _always_ be the hash for this file, so if you change something in this file, for example:

###### myfile.txt
```
a
b
d
```

The has will change to something completely different, like: `21243kjl1laksjdf3lkj232`.

Not the problem with pickles are that they are compressed binary files, so instead of having clear text we have:


###### afile.pck
```
��-�pandas.core.frame��	DataFrame���)��}�(�_data��pandas.core.internals.managers��
                                                                                          BlockManager���)��(]�(�pandas.core.indexes.base��
_new_Index���h
              �Index���}�(�data��numpy.core.multiarray��
                                                        _reconstruct����numpy��ndarray���K��Cb���R�(KK��h�dtype����O8�KK��R�(K�|�NNNJ����J����K?t�b�]�(�SEQ_FR��CLIP_FR��METHOD�IN_POINT��	OUT_POINT�et�b�n�andas.core.indexes.numeric��
Int64Index���}�(hhhK��h��R�(KM��h�i8�KK��R�(K�<�NNNJ����J����Kt�b��t�bh+Nu��R�e]�(hhK��h��R�(KKM��h�f8�KK��R�(Kh9NNNJ����J����Kt�b�Bfffff�7@fffff�7@fffff�7@fffff�7@fffff�7@fffff�7@fffff�7@fffff�7@fffff�7@fffff�7@fffff�7@8@8@8@8@8@8@8@8@8@8@8@9@9@9@9@9@9@9@9@9@9@9@9@fffff�=@fffff�=@fffff�=@fffff�=@fffff�=@fffff�=@fffff�=@fffff�=@ff

```

Now this is no problem, but if this pickle is `1Gb`, it will get hashed and compressed to about `300Mb`. So now your git repo has increased in size by `300Mb` - keeping in mind the files with code are `<1Mb`. If you change this file the problem just amplifies, it'll get a new hash, but now a new compressed version also needs to be tracks, so your git repo is now `>600Mb`.

This problem extends to notebooks as well - and the **ONLY** way I believe to push a notebook to git is by clearing all the cell's output. Especially if you've got big plotly graphs. 

## Writing notebooks in a way that helps you move it to a script.

Whenever you can, start abstracting pieces of code into functions. Usually a good rule of thumb is to convert each cell's code into a function, if you've written your cells to be little stand-alone pieces of code. For example:

In [8]:
import pandas as pd

In [9]:
df = pd.DataFrame({'TIME':list(range(0,10)),
                   'TYPE':['red', 'blue']*5,
                  'COUNT1':list(range(0,20))[0:20:2],
                  'COUNT2':list(range(0,100))[0:100:10]})
df

Unnamed: 0,TIME,TYPE,COUNT1,COUNT2
0,0,red,0,0
1,1,blue,2,10
2,2,red,4,20
3,3,blue,6,30
4,4,red,8,40
5,5,blue,10,50
6,6,red,12,60
7,7,blue,14,70
8,8,red,16,80
9,9,blue,18,90


In [10]:
df.melt(id_vars=['TIME', 'TYPE'],
       var_name='CATEGORY',
       value_name='COUNT')

Unnamed: 0,TIME,TYPE,CATEGORY,COUNT
0,0,red,COUNT1,0
1,1,blue,COUNT1,2
2,2,red,COUNT1,4
3,3,blue,COUNT1,6
4,4,red,COUNT1,8
5,5,blue,COUNT1,10
6,6,red,COUNT1,12
7,7,blue,COUNT1,14
8,8,red,COUNT1,16
9,9,blue,COUNT1,18


Now we'll refactor these into functions:

In [11]:
def create_data():
    """
    Function that creates dummy data set
    
    Returns
    pd.DataFrame()
    """
    df = pd.DataFrame({'TIME':list(range(0,10)),
                   'TYPE':['red', 'blue']*5,
                  'COUNT1':list(range(0,20))[0:20:2],
                  'COUNT2':list(range(0,100))[0:100:10]})
    return df

In [12]:
def melt_df(df):
    """
    Function that melts df and sets the columns headings.
    
    Parameters
    df (pd.DataFrame)
    
    Returns
    df (pd.DataFrame)
    """
    df = df.melt(id_vars=['TIME', 'TYPE'],
                 var_name='CATEGORY',
                 value_name='COUNT')
    
    return df

In [13]:
df = create_data()
df = melt_df(df)
df.head()

Unnamed: 0,TIME,TYPE,CATEGORY,COUNT
0,0,red,COUNT1,0
1,1,blue,COUNT1,2
2,2,red,COUNT1,4
3,3,blue,COUNT1,6
4,4,red,COUNT1,8


By making these pipelines functions, you can easily copy and paste them over to your scripts. Which brings us to the next piece of advice, that I myself forget sometimes. DRY! Don't Repeat Yourself!

Try not to copy and paste functions and classes you've created in a previous notebook into the next. Rather, take the time, make a utils module, paste the code in there and import it into your next notebook. Not only does this promote you refactoring your code, but it also keeps your notebooks clean an tidy as you contiue through a project. 