In [None]:
%load_ext autoreload
%autoreload 2

! mkdir -p data

<center>
    <h1>Doing Data Viz with Python</h1>
    <br><br><br>
    <img src="/files/img/title_viz.png" width="500"/>
    <br>
    <img src="/files/img/python_logo.svg" width="300"/>
    <h2>Ned Letcher</h2>
    <h3>Code: <a href="https://github.com/ned2/python-viz">github.com/ned2/python-viz</a></h3>
</center>

## What is data visualisation?
* Graphic representation of data that visually encodes information
* Reveals patterns, trends, relationships
* Used to discover and communicate insights

<center>
    <h3>Examples of Visualisations</h3>
    <img src="/files/img/plot_types.svg" width="800"/>
</center>

## Why visualise data?
* Summary statistics mislead
* Human visual perception is a powerful tool 

### The Datasaurus

Like [Anscombe's Quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet), the Dinosaur shows us the pitfalls of using summary statistics to understand a dataset.

https://www.autodesk.com/research/publications/same-stats-different-graphs

In [None]:
! wget -P data https://raw.githubusercontent.com/jumpingrivers/datasauRus/main/inst/extdata/DatasaurusDozen.tsv

In [None]:
from utils.datasaurus import make_datasaurus

make_datasaurus();

Let's have a look at some pedestrian traffic for Southern Cross Station for March 2020. 

In [None]:
! ./bin/get-pedestrian-data.sh

In [None]:
from utils.pedestrian import load_and_clean_pedestrian_data

pedestrian_df = load_and_clean_pedestrian_data(
    "data/Pedestrian_Counting_System_-_Monthly__counts_per_hour_.csv"
)

sc_march_2020_df = pedestrian_df[
    (pedestrian_df["Year"] == 2020)
    & (pedestrian_df["Month"] == "March")
    & (pedestrian_df["Sensor_Name"] == "Southern Cross Station")
]

sc_march_2020_df.head(20)

_Tabular representations of datasets are difficult to interpret_

Let's visualisze the same data using a line chart:

In [None]:
sc_march_2020_df.plot(
    x="Date_Time",
    y="Hourly_Counts",
    figsize=(15, 5),
    title="Hourly Counts for Southern Cross Station, March 2020",
);

_**Hot Tip:** Always title your plots!_

### _Visualisations help reveal patterns within data_

Often the most effective way to do things with data:

* decribe
* explore
* summarise
* communicate

And sometimes it is more accurate than quantitative approaches...

## Python Data Viz Libraries

There are a _lot_ of Python data viz libraries...

<br>
<center>
    <img src="/files/img/python_viz_landscape.svg" width=800/>
    <h2><i>Which visualisation library to use?</i></h2>
</center>

***
<br>
<center>
    <img src="/files/img/python_viz_libs.svg" width=1000/>
    <h2><i>A framework for comparing general purpose Python Viz Libraries</i></h2>
</center>

## Matplotlib and Pandas

Pandas' `plot` method defaults to using Matplotlib.

Other Pandas [plotting backends](https://pandas.pydata.org/pandas-docs/dev/user_guide/visualization.html#plotting-backends) currently available are:
* [Plotly](https://plotly.com/python/pandas-backend/)
* [hvPlot](https://hvplot.holoviz.org/user_guide/Pandas_API.html)
* [Bokeh](https://github.com/PatrikHlobil/Pandas-Bokeh)
* [Altair](https://github.com/altair-viz/altair_pandas)

In [None]:
import matplotlib as mpl
%matplotlib inline

sc_march_2020_df.plot(
    x="Date_Time",
    y="Hourly_Counts",
    figsize=(15, 5),
    title="Hourly counts for South Bank station, March 2020",
);

**Limitations**
1. Doesn't look very pretty out of the box
2. Pandas plotting API is limited
3. Static image: can’t zoom or toggle visibility of data

### Improving the Aesthetics of Static Plots

Suggestions:
* Use an alternative style
* Increase the plot DPI

In [None]:
import matplotlib.pyplot as plt

plt.style.available

In [None]:
plt.style.use("seaborn")

fig = plt.figure(dpi=300, figsize=(15, 5))

sc_march_2020_df.plot(
    x="Date_Time",
    y="Hourly_Counts",
    legend=None,
    figsize=(15, 5),
    title="Hourly counts for South Bank station, March 2020",
    ax=plt.gca(),  # supply Pandas with the axes from the current figure
);

### Use `ipympl` for basic interactivity

In [None]:
%matplotlib widget

sc_march_2020_df.plot(
    x="Date_Time",
    y="Hourly_Counts",
    figsize=(15, 5),
    title="Hourly counts for South Bank station, March 2020",
);

### For More statistical use-cases, see if Seaborn has what you need

[Seaborn Gallery](https://seaborn.pydata.org/examples/index.html)

#### Clustermap

In [None]:
%matplotlib inline

import pandas as pd
import seaborn as sns

sns.set_theme()


def make_clustermap():
    # Load the brain networks example dataset
    df = sns.load_dataset("brain_networks", header=[0, 1, 2], index_col=0)

    # Select a subset of the networks
    used_networks = [1, 5, 6, 7, 8, 12, 13, 17]
    used_columns = (
        df.columns.get_level_values("network").astype(int).isin(used_networks)
    )
    df = df.loc[:, used_columns]

    # Create a categorical palette to identify the networks
    network_pal = sns.husl_palette(8, s=0.45)
    network_lut = dict(zip(map(str, used_networks), network_pal))

    # Convert the palette to vectors that will be drawn on the side of the matrix
    networks = df.columns.get_level_values("network")
    network_colors = pd.Series(networks, index=df.columns).map(network_lut)

    # Draw the full plot
    g = sns.clustermap(
        df.corr(),
        center=0,
        cmap="vlag",
        row_colors=network_colors,
        col_colors=network_colors,
        dendrogram_ratio=(0.1, 0.2),
        cbar_pos=(0.02, 0.32, 0.03, 0.2),
        linewidths=0.75,
        figsize=(12, 13),
    )

    g.ax_row_dendrogram.remove()


make_clustermap()

#### Ridge Plot

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

sns.set_theme(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})


def make_plot():
    # Create the data
    rs = np.random.RandomState(1979)
    x = rs.randn(500)
    g = np.tile(list("ABCDEFGHIJ"), 50)
    df = pd.DataFrame(dict(x=x, g=g))
    m = df.g.map(ord)
    df["x"] += m

    # Initialize the FacetGrid object
    pal = sns.cubehelix_palette(10, rot=-0.25, light=0.7)
    g = sns.FacetGrid(df, row="g", hue="g", aspect=15, height=0.5, palette=pal)

    # Draw the densities in a few steps
    g.map(
        sns.kdeplot,
        "x",
        bw_adjust=0.5,
        clip_on=False,
        fill=True,
        alpha=1,
        linewidth=1.5,
    )
    g.map(sns.kdeplot, "x", clip_on=False, color="w", lw=2, bw_adjust=0.5)

    # passing color=None to refline() uses the hue mapping
    g.refline(y=0, linewidth=2, linestyle="-", color=None, clip_on=False)

    # Define and use a simple function to label the plot in axes coordinates
    def label(x, color, label):
        ax = plt.gca()
        ax.text(
            0,
            0.2,
            label,
            fontweight="bold",
            color=color,
            ha="left",
            va="center",
            transform=ax.transAxes,
        )

    g.map(label, "x")

    # Set the subplots to overlap
    g.figure.subplots_adjust(hspace=-0.25)

    # Remove axes details that don't play well with overlap
    g.set_titles("")
    g.set(yticks=[], ylabel="")
    g.despine(bottom=True, left=True)


make_plot()

**Hot Tip:** wrap up code to make plots into functions. Useful for:
 - parameterising you plot and facilitating code reuse
 - not polluting the global namespace

### Summary: Matplotlib, Seaborn, and Pandas

Matplotlib is a powerful and expressive visualisation library, but
* can be verbose to produce more complex plots
* stateful API can be counter-intuitive 
* does not support interactivity well (but can create [animated plots](https://matplotlib.org/stable/api/animation_api.html))

Well-suited contexts of use:
* Creating high-quality bespoke visualisations needed for publication (see [How to make beautiful data visualizations in Python with matplotlib](http://www.randalolson.com/2014/06/28/how-to-make-beautiful-data-visualizations-in-python-with-matplotlib/))
* Use via Pandas for rapid exploratory data analysis  
* Use via Seaborn if it's a good fit for your analysis


## Dynamic visualisation with Plotly, Bokeh, and Altair

Web-first visualisation libraries based on JavaScript, that all have interactive features out of the box

### Plotly

We're using [Plotly Express](https://plotly.com/python/plotly-express/), Plotly's higher-level APi for creating Plotly figures.

In [None]:
import plotly.express as px

px.line(
    sc_march_2020_df,
    x="Date_Time",
    y="Hourly_Counts",
    title="Hourly counts for South Bank station, March 2020",
)

In [None]:
gapminder_df = px.data.gapminder()
px.scatter(
    gapminder_df,
    x="gdpPercap",
    y="lifeExp",
    size="pop",
    color="continent",
    hover_name="country",
    log_x=True,
    size_max=60,
    animation_frame="year",
    range_y=[25, 90],
    title="GDP per capita compared with life expectency over time"
)

## HoloViews (and Bokeh)

We're using [Holoviews](https://holoviews.org) as a high-level libray that targets [Bokeh](https://bokeh.org/) as well as Matplotlib and Plotly.

In [None]:
import holoviews as hv
hv.notebook_extension()
hv.extension('bokeh')

plot = hv.Curve(sc_march_2020_df, "Date_Time", "Hourly_Counts")
plot.opts(frame_width=900, frame_height=250)

## hvPlot

Even _higher_ level language build on top of Holoviews.

https://hvplot.holoviz.org

## Altair

Altair is a declarative API for producing visualisations based on the [Vega-Lite](http://vega.github.io/vega-lite/) visualization grammar.

Based on a grammar of graphics (like [ggplot](https://ggplot2.tidyverse.org/)), that also includes a grammar of interactive graphics.

In [None]:
import altair as alt

alt.Chart(sc_march_2020_df).mark_line().encode(
    x="Date_Time:T", y="Hourly_Counts:Q"
).properties(width=1000)

Interaction is not configured by default. You need to wire this up. 

This involves a bit more [configuration](https://altair-viz.github.io/user_guide/interactions.html), 
but it means that you have more flexible interaction capabilities available to you.

In [None]:
scales = alt.selection_interval(bind="scales")

alt.Chart(sc_march_2020_df).mark_line().encode(
    x="Date_Time:T", y="Hourly_Counts:Q"
).properties(width=1000).add_selection(scales)

## Reactive Interfaces

See my talk on _Python Libraries for building Data Apps_: https://youtu.be/jI5zLf9Hvd8

Wrapping up visualisation code into functions/classes help make more reusable.

But still slow to interact with; not an ideal interface.

<center>
    <img src="/files/img/reactive.svg" width=1000/>
</center>

The rest of the libraries we will look at follow this fundamental pattern in some way. The main distinction is whether the code is run within the web client (in JavaScript), or whether it's run in a different process with Python.

Local callbacks offer more responsivity and enable embedding the visualisation within a statically hosted HTML page, but are limited to working with smaller amounts of data.

Python callbacks enable working with much larger datasets, and the full power of custom Python code to define callback logic, but require the relevant server process to be configured and run somewhere and will be less responsive than clientside callbacks, especially when the server process is running on a different machine to the client.

## ipywidgets

[ipywidgets](https://ipywidgets.readthedocs.io/en/latest/) is a library of interactive widgets that can be combined with other visualisation libraries to create interactive interfaces within Jupyter Notebooks.

***
<center>
    <img src="/files/img/voila.svg" width=1000/>
</center>

Run from a terminal to launch the Voila demo notebook/dashboard:

    $ voila demos/voila.ipynb

***
<center>
    <img src="/files/img/panel.svg" width=900/>
</center>

***

<center>
    <img src="/files/img/dash.svg" width=900/>
</center>

***

<center>
    <img src="/files/img/streamlit.svg" width=1000/>
</center>

## Deploying your Apps

Altair can be deployed statically.

The rest require you to run your app somewhere.

* Your own server
* A hosted virtual machine: eg AWS EC2, Digital Ocean, Heroku
* Serverless cloud services:
  * AWS: Elastic Beanstalk, Fargate, Lambda
  * GCP: App Engine, Cloud Run


***
<br>
<center>
    <img src="/files/img/python_viz_libs.svg" width=1000/>
</center>

## Suggestions for choosing a library

### For general visualisation
* **Matplotlib** is good for bespoke static visualisations 
* **Seaborn** is good for building more statistically-oriented static plots
* **Pandas** (with Matplotlib) is great for rapid exploratory analysis
* **Altair** has a delightful AP and is good all-round library if your dataset is not too big, and can be embedded into static plots
* **HoloViews** is a powerful abstraction API that can target all of: Matplotlib, Bokeh, and Plotly 
* *Plotly* offers a rich ecosystem: Plotly Express is consise way to rapidly build advanced plots, and has useful ability to extend easily to Dash apps. 


### Getting the right level of abstraction
* Unless needed, avoid working primarily with the lower-level Matplotlib, Bokeh, or Plotly, libraries
* Instead use one of the high-level libraries for better productivity.
* However being familiar with the lower level API is important, customising plots producted by high-level libraries is an effective pattern.


### Interactive apps
* **ipywidgets** is a flexible tool that can be combined with all the libraries mentioned here to make them reactive
* **Voila** lets you easily deploy your ipywidgets-based notebooks as dashboards
* **Panel** works well if you have a heavy oriented workflow and want a balance of customisability with preconfigured styling.
* **Dash** gives you the ability to make scalable and heavily customisable web apps (best option for deployment to larger user base), but required more effort to visually style.
* **Streamlit** Good for rapidly building custom data tools without having to worry about layout or aesthetics

 
One library won't always be right; have more than one tool in your toolkit. Tryout different libraries and see which ones feel good.

## Data Viz Hot Tips

* Try to use high-level visualisation libraries where possible  
* _Always_ title and label your plots
* Abstract code to produce distinct visualisations into functions
* If you use the functions, across more than one notebook, move them into a custom Python package.