# Jupyter Notebooks

* Interactive computational environment for working with **Ju**lia, **P**ython, and **R** and more (40+ languages)
* Composed of input and output cells that can contain code or text (via Markdown) and produce a range of outputs
* Enables [literate programming](https://en.wikipedia.org/wiki/Literate_programming) paradigm of text and documentation interleaved with code

**_Pro tip: Use more modern JupyterLab over classic Jupyter Notebook_**


**_Bonus pro tip: Get comfortable with Jupyter's keybindings_**

There are also a range of useful macros that are extensions beyond Python features.

In [None]:
# these macros make Jupyter re-run imported libraries when it detects their 
# contents on disk have changed

%load_ext autoreload
%autoreload 2

## Data: Weekly Payroll Jobs and Wages in Australia

Data provided by the Australian Bureau of Statistics pertaining to weekly numbers of jobs with payroll data. This is based on the Australian Tax Office's single touch payroll data, which is how most businesses report salaries and wages, pay as you go (PAYG) withholding, and superannuation.

[Weekly Jobs and Wages Index for Week ending 21/03/2021.](https://www.abs.gov.au/statistics/labour/earnings-and-work-hours/weekly-payroll-jobs-and-wages-australia/week-ending-27-march-2021#data-download)


[Excel file used -- Table 5: Sub-state - Payroll jobs indexes](https://www.abs.gov.au/statistics/labour/earnings-and-work-hours/weekly-payroll-jobs-and-wages-australia/week-ending-27-march-2021/6160055001_DO005.xlsx)

## Pandas Data Wrangling

Pandas is a powerful Swiss Army knife for doing data manipulation and analysis in Python.

The two primary data structures it makes available are the `DataFrame`, for working with two-dimensional tabular data, and the `Series` for working with one-dimensional arrays.

We'll need to install the `openpyxl` in order to read Excel files.

In [None]:
# you can run shell commands by prefixing with `!`
! pip install openpyxl

In [None]:
import pandas as pd

df = pd.read_excel("data/6160055001_DO005.xlsx", sheet_name=1)
df.shape

Note that Jupyter defaults to displaying the output of the previous cell on execution.

In [None]:
df.head()

Let's use the `dtale` package to better explore the DataFrame.

In [None]:
import dtale

dtale.show(df)

In [None]:
jobs_raw_df = pd.read_excel(
    "data/6160055001_DO005.xlsx",
    sheet_name=1,  # zero-indexed, so this is the second sheet!
    usecols="A:BO",
    skiprows=5,
    skipfooter=2,
)

jobs_raw_df.head()

This data is in [wide](https://en.wikipedia.org/wiki/Wide_and_narrow_data) format. We need to make it into long (or narrow) form, where each row contains a single record.

We also need to split the first two columns into their codes and names.

In [None]:
# get a DataFrame of all columns except the first two
date_cols_df = jobs_raw_df.iloc[:, 2:]

# split col 1 into codes and names
ste_cols_df = (
    jobs_raw_df[jobs_raw_df.columns[0]]
    .str.split(r"\. ", expand=True)
    .rename(columns={0: "STE_CODE16", 1: "STE_NAME16"})
)

# split col 2 into codes and names
sa4_cols_df = (
    jobs_raw_df[jobs_raw_df.columns[1]]
    .str.split(r"\. ", expand=True)
    .rename(columns={0: "SA4_CODE16", 1: "SA4_NAME16"})
)

# combine the 3 sets of columns into a single DataFrame, and then use the melt method
# to convert it into long format
jobs_df = pd.concat([ste_cols_df, sa4_cols_df, date_cols_df], axis=1).melt(
    id_vars=["STE_CODE16", "STE_NAME16", "SA4_CODE16", "SA4_NAME16"],
    var_name="Date",
    value_name="Index",
)

jobs_df

In [None]:
jobs_df.dtypes

Let's get the mean index for each state across all dates:

In [None]:
states_jobs = jobs_df.groupby(["STE_NAME16", "Date"])["Index"].mean()
states_jobs

Now get the mean of the state means for each date:

In [None]:
country_jobs = states_jobs.mean(level="Date")
country_jobs

# Visualising with Pandas & Matplotlib

In [None]:
# convert Series from previous cell to DataFrame for easier plotting

country_jobs_df = country_jobs.reset_index()
country_jobs_df

### Use Pandas' API to make Matplotlib plot

We're going to start by using Pandas to plot the wage Index over time.

Pandas uses Matplotlib under the hood for plotting. Pandas' [plotting backend](https://pandas.pydata.org/pandas-docs/dev/user_guide/visualization.html#plotting-backends) feature allows other libraries to be used.

Current plotting backends available:

* [Plotly](https://plotly.com/python/pandas-backend/)
* [Bokeh](https://github.com/PatrikHlobil/Pandas-Bokeh)

In [None]:
# tell Jupyter to render Matplotlib
%matplotlib inline

country_jobs_df.plot(x="Date", y="Index", title="Weekly Payroll Jobs and Wages Index");

**_Pro tip: Always title your plots!_**

## Improving Aesthetics of Matplotlib plots

**Approach 1: use a theme**

In [None]:
import matplotlib.pyplot as plt

plt.style.available

In [None]:
plt.style.use("ggplot")

country_jobs_df.plot(x="Date", y="Index", title="Weekly Payroll Jobs and Wages Index");

**Approach 2: Increase the DPI**

In [None]:
fig = plt.figure(dpi=300, figsize=(15, 5))

country_jobs_df.plot(
    x="Date", y="Index", ax=plt.gca(), title="Weekly Payroll Jobs and Wages Index"
);

### Use ipympl Widget for interactivity

_Note: this requires the ipympl package to be installed_

Let's plot the Index of all states. This is simple with Pandas, but requires our data to be in wide format.

In [None]:
%matplotlib widget

# use the unstack method to get it into wide format

states_jobs.unstack(level=0).plot(
    figsize=(15, 5), title="Weekly Payroll Jobs and Wages Index by State"
);

What if our data was in long (tidy) format?

This is harder with Pandas. Easier to use a different tool.

In [None]:
states_jobs_df = states_jobs.reset_index()
states_jobs_df

## Other plotting Options

### Seaborn

Like Pandas, Seaborn is built on top of Matplotlib. It provides an API that is particularly well-suited for making more scientific/statistically oriented plots.

See the [Seaborn gallery](https://seaborn.pydata.org/examples/index.html) for examples.

In [None]:
import seaborn as sns

plt.figure(figsize=(13, 6))

sns.lineplot(
    data=states_jobs_df,
    x="Date",
    y="Index",
    hue="STE_NAME16",
).set(title="Weekly Payroll Jobs and Wages Index by State");

### Plotly

Plotly is powered by a JavaScript-based visualisation library that gives you interactive plots out of the box.

There is a lower-level Python interface for defining plots as [`Figure` objects](https://plotly.com/python/figure-structure), as well as the more concise [Plotly Express](https://plotly.com/python/plotly-express/) that produces `Figure` objects with much less coding required.

A gallery of examples can be found [here](https://plotly.com/python/).

In [None]:
import plotly.express as px

px.line(
    states_jobs_df,
    x="Date",
    y="Index",
    color="STE_NAME16",
    title="Weekly Payroll Jobs and Wages Index by State",
    width=1200,
    height=500,
)

### Adding Country-level mean

In [None]:
states_and_country_df = pd.concat(
    [states_jobs_df, country_jobs_df.assign(STE_NAME16="AUS")]
)
state_names = list(states_jobs_df["STE_NAME16"].unique())

px.line(
    states_and_country_df,
    x="Date",
    y="Index",
    color="STE_NAME16",
    title="Weekly Payroll Jobs and Wages Index by State",
    color_discrete_map={"AUS": "black"},
    category_orders={"STE_NAME16": ["AUS"] + state_names},
    line_dash="STE_NAME16",
    line_dash_sequence=["dot"] + ["solid" for _state in state_names],
    width=1200,
    height=500,
)

### Other options to consider
* [Altair](https://altair-viz.github.io/) (similar to ggplot, with good interactive API)
* [Holoviews](https://holoviews.org/) (has Matplotlib, Bokeh, and Plotly backends)


<center>
    <img src="/files/img/python_viz_libs.svg" width=900/>
    <h2><i>Which visualisation library to use?</i></h2>
</center>

# Spatial Visualisation

A useful tool for working with spatial data is [Geopandas](https://geopandas.org/).

It provides `GeoSeries` and `GeoDataFrame` data structures, like Pandas' `Series` and `DataFrame`, but for working with shaply objects: 
* Points / Multi-Points
* Lines / Multi-Lines
* Polygons / Multi-Polygons

In [None]:
import geopandas as gpd

gdf = gpd.read_file("data/sa4_2016_aust_shape/SA4_2016_AUST.shp")

gdf

Like Pandas, Geopandas can produce plots using Matplotlib.

## PLotting a Choropleth with Geopandas 

Let's plot a spatial heat map plat (choropleth) of mean index by SA4 region.

We'll wrap it up in a function that takes a data as an input parameter, and filters the data down to results for that date and plots the choropleth.

In [None]:
import contextily as cx


def plot_wage_chloropleth(sa4_gdf, jobs_df, date):
    """Plot a chloropleth map of jobs Index for a given date"""
    # remove records with empty geometry, and filter data to current month,
    # then join with geo data
    sa4_gdf = sa4_gdf[~sa4_gdf["geometry"].isnull()]
    filtered_df = jobs_df[jobs_df["Date"] == date]
    sa4_gdf = sa4_gdf.merge(filtered_df, on="SA4_CODE16", validate="one_to_one")

    fig, ax = plt.subplots()
    sa4_gdf.plot(
        ax=ax,
        edgecolor="black",
        column="Index",
        vmin=jobs_df["Index"].min(),
        vmax=jobs_df["Index"].max(),
    ).set(title="Australian Jobs and Wages Index")

    # set the basemap tiles with contexily
    cx.add_basemap(ax, crs=gdf.crs.to_string(), source=cx.providers.CartoDB.Voyager)
    ax.axis("off")


plot_wage_chloropleth(gdf, jobs_df, "2020-01-04")

**_Pro tip: Move code for producing distinct plots into function_**

## Plotting a Choropleth with Folium

Folium is a Python wrapper around the Leaflet JavaScript library for producing web-based interactive maps.

In [None]:
import folium


def plot_wage_chloropleth_folium(sa4_gdf, jobs_df, date):
    """Plot a chloropleth map of jobs Index for a given date"""
    # remove records with empty geometry, and filter data to current month
    sa4_gdf = sa4_gdf[~sa4_gdf["geometry"].isnull()]
    filtered_df = jobs_df[jobs_df["Date"] == date]

    folium_map = folium.Map(location=[-22, 133], zoom_start=5)

    choropleth = folium.Choropleth(
        geo_data=sa4_gdf,
        data=filtered_df,
        columns=["SA4_NAME16", "Index"],
        key_on="feature.properties.SA4_NAME16",
        fill_color="YlGn",
        fill_opacity=0.7,
        line_opacity=0.2,
        legend_name="Wage Index",
        highlight=True,
    )
    choropleth.geojson.add_child(
        folium.features.GeoJsonTooltip(["SA4_NAME16"], labels=False)
    )
    choropleth.add_to(folium_map)
    folium.LayerControl().add_to(folium_map)
    return folium_map


plot_wage_chloropleth_folium(gdf, jobs_df, "2020-01-04")

Note that in the above solution, the color map is not being mapped to the ranges of the data from the whole year. This seems to be trickier to do with Folium's choropleth than GeoPandas' choropleth.

Working hypothesis is that for choropleth's needing any kind of customisation, the best way to approach is to user the lower-level `folium.GeoJson` class and set the `style_function` and `highlight_function` parameters.

## Other options to consider:

* [ipyleaflet](https://ipyleaflet.readthedocs.io/), for creating Leaflet plots along with interactive ipywidget controls.
* [Geoviews](https://geoviews.org/) (related to the Holoviews project)
* [Plotly](https://plotly.com/python/mapbox-layers/) (some integration with Mapbox requires access token)
* [pydeck](https://deckgl.readthedocs.io/en/latest) (based on [deck.gl](https://deck.gl) and with Jupyter widget support)
* [Datashader](https://datashader.org/) for rendering spatial visualisations of really large datasets

# Working with .ipynb files 

Difference between R Markdown and Jupyter notebooks is the serialisation format.
* R Mardown saves to markdown (!)
* Jupyter notebooks save to `.ipynb` format, which is a JSON format containing cell contents as well as output, and cell metadata

JSON format introduces a few challenges:
* results in horrible diffs (not good for version control and collaboration)
* output in notebooks can make very large files (not good for version control)
* output in notebooks can result in sensitive being version-controlled 


## Git Hooks to the Rescue! 

Git hooks allow us to register commands that occur on specific git events, like `pre-commit` and `post-checkout`.

[`pre-commit`](https://pre-commit.com/) is a Python library for easily creating custom workflows for your project by registering commands with Git hooks.

### Workflow 1:

Use register a `pre-commit` hook that uses the nbconvert tool to automatically strip out all cell output before it is commited to your git repo:

    jupyter nbconvert --clear-output --inplace my_notebook.ipynb
    

[Jupytext](https://jupytext.readthedocs.io/) is a Python library for two-way synchronisation between .ipynb and Markdown or Python.

### Workflow 2:

* Register your `.ipynb` files to be paired with either either Markdown or Python (which retains Python/text as specially fenced/commented block).
* Do not track `.ipynb` files with version control (and add them to eg `.gitignore`)
* Use a [pre-commit hook to automatically convert](https://jupytext.readthedocs.io/en/latest/using-pre-commit.html) your `.ipynb` files to the target format before you commit.  


If you're working with notebooks in an operational context where there is any possibility you will be working with personally identifying (or otherwise sensitive)information, you should _absolutely definitely adopt one of these workflows_.

# Making Documents with Jupyter Notebook

The nbconvert tool that comes with Jupyter allows you to convert to PDF and HTML (as well as other formats).

For being repeatable workflows for making high-quality books and reports, I would recommend looking at [Jupyter book](https://jupyterbook.org/), which supports a wide range of export targets and has good integration with Jupytext.

# Interactivity

<center>
    <img src="/files/img/reactive.svg" width=900/>
</center>

## ipywidgets Example

In [None]:
from ipywidgets import interact


@interact(date="2020-01-04")
def interactive_wages(date):
    return plot_wage_chloropleth(gdf, jobs_df, date)

In [None]:
date_dropdowns = sorted(set(str(timestamp.date()) for timestamp in jobs_df["Date"]))


@interact(date=date_dropdowns)
def interactive_wages(date):
    return plot_wage_chloropleth(gdf, jobs_df, date)

### Some good libraries to choose from
* [ipywidgets](https://ipywidgets.readthedocs.io/) (and [Voila](https://github.com/voila-dashboards/voila) for deploying as app)
* [Dash](https://plotly.com/dash/)
* [Streamlit](https://streamlit.io/)
* [Panel](https://panel.holoviz.org/)

See my talk on [Python Libraries for Data Apps](https://www.youtube.com/watch?v=jI5zLf9Hvd8&lc) for comparing these.

# Conclusions

<center>
    <img src="/files/img/python_viz_libs.svg" width=900/>
    <h2><i>Which visualisation library to use?</i></h2>
</center>

### Suggestions

* Always try to use one of the high-level library where you can
* But still important to be able to work with the low-level interface (with the exception of Vega-Lite)
* See my talk on [Python Libraries for Data Apps](https://www.youtube.com/watch?v=jI5zLf9Hvd8&lc), for a guide for choosing between Panel, Dash, Streamlit and Voila
* They're all very capable libraries, so pick one and give it a go!

## Python’s strengths

* Large user base -> lots of resources and support
* Rich ecosystem of libraries (bother data and otherwise)
* General-purpose programming language
* Well-suited for engineering for scale
* Lots of good cloud-based deployment options


But at the end of the day, the best tool is the one you can solve your problems with.

Pick one, or both, and dive into it!