In [None]:
%load_ext autoreload
%autoreload 2

<center>
    <h1>Doing Data Viz with Python</h1>
    <br><br><br>
    <img src="/files/notebooks/img/title_viz.png" width="500"/>
    <br>
    <img src="/files/notebooks/img/python_logo.svg" width="300"/>
    <h2>Ned Letcher</h2>
    <h3>Code: <a href="https://github.com/ned2/melbviz">github.com/ned2/melbviz</a></h3>
</center>

## What is data visualisation?
* Graphic representation of data that visually encodes information
* Reveals patterns, trends, relationships
* Used to discover and communicate insights

<center>
    <h3>Examples of Visualisations</h3>
    <img src="/files/notebooks/img/plot_types.svg" width="800"/>
</center>

## Why visualise data?

Let's have a look at some pedestrian traffic for Southern Cross Station for March 2020. 

We're using the `PedestrianDataset` class from the Melbviz package to speed this up.

In [None]:
from melbviz.pedestrian import PedestrianDataset
from melbviz.config import MELBVIZ_CLEANED_DATA_PATH

data = PedestrianDataset.from_parquet(MELBVIZ_CLEANED_DATA_PATH)

The `PedestrianDataset` class has a handy `filter` method that we can use to quickly get a filtered dataset from.

In [None]:
sc_march_2020 = data.filter(year=2020, month="March", sensor="Southern Cross Station")
sc_march_2020_df = sc_march_2020.df

Let's have a look at it...

In [None]:
sc_march_2020_df.head(10)

_Tabular representations of datasets are difficult to interpret_

Let's visualisze the same data using a line chart:

In [None]:
sc_march_2020_df.plot(
    x="Date_Time",
    y="Hourly_Counts",
    figsize=(15, 5),
    title="Hourly Counts for Southern Cross Station, March 2020",
);

_**Hot Tip:** Always title your plots!_

### _Visualisations help reveal patterns within data_

Often the most effective way to do things with data:

* decribe
* explore
* summarise
* communicate

And sometimes it is more accurate than quantitative approaches...

### The Datasaurus

Like [Anscombe's Quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet), the Dinosaur shows us the pitfalls of using summary statistics to understand a dataset.

https://www.autodesk.com/research/publications/same-stats-different-graphs

In [None]:
! mkdir -p data
! wget -P data https://raw.githubusercontent.com/jumpingrivers/datasauRus/main/inst/extdata/DatasaurusDozen.tsv

In [None]:
from melbviz.datasaurus import make_datasaurus

make_datasaurus();

## Python Data Viz Libraries

There are a _lot_ of Python data viz libraries...

<br>
<center>
    <img src="/files/notebooks/img/python_viz_landscape.svg" width=800/>
    <h2><i>Which visualisation library to use?</i></h2>
</center>

***
<br>
<center>
    <img src="/files/notebooks/img/python_viz_libs.svg" width=1000/>
    <h2><i>A framework for comparing general purpose Python Viz Libraries</i></h2>
</center>

## Matplotlib and Pandas

Pandas' `plot` method defaults to using Matplotlib.

(Other Pandas [plottling backends](https://pandas.pydata.org/pandas-docs/dev/user_guide/visualization.html#plotting-backends) currently available are [Plotly](https://plotly.com/python/pandas-backend/) and [Bokeh](https://github.com/PatrikHlobil/Pandas-Bokeh))

In [None]:
import matplotlib as mpl
%matplotlib inline

sc_march_2020_df.plot(
    x="Date_Time",
    y="Hourly_Counts",
    figsize=(15, 5),
    title="Hourly counts for South Bank station, March 2020",
);

**Limitations**
1. Doesn't look very pretty out of the box
2. Pandas plotting API is limited
3. Static image: can’t zoom or toggle visibility of data

### Improving the Aesthetics of Static Plots

Start by using an alternative style

In [None]:
import matplotlib.pyplot as plt

plt.style.available

In [None]:
plt.style.use("seaborn")

sc_march_2020_df.plot(
    x="Date_Time",
    y="Hourly_Counts",
    legend=None,
    figsize=(15, 5),
    title="Hourly counts for South Bank station, March 2020",
);

In [None]:
fig = plt.figure(dpi=300, figsize=(15, 5))

sc_march_2020_df.plot(
    x="Date_Time",
    y="Hourly_Counts",
    legend=None,
    linewidth=1.2,
    title="Hourly counts for South Bank station, March 2020",
    ax=plt.gca(),  # supply Pandas with the axes from the current figure
);

### Use `ipympl` for interactivity

In [None]:
%matplotlib widget

sc_march_2020_df.plot(
    x="Date_Time",
    y="Hourly_Counts",
    figsize=(15, 5),
    title="Hourly counts for South Bank station, March 2020",
);

### For More statistical use-cases, see if Seaborn has what you need

[Seaborn Gallery](https://seaborn.pydata.org/examples/index.html)

In [None]:
%matplotlib inline
import seaborn as sns


def make_flights_relplot():
    sns.set_theme(style="dark")
    flights = sns.load_dataset("flights")

    # Plot each year's time series in its own facet
    relplot = sns.relplot(
        data=flights,
        x="month",
        y="passengers",
        col="year",
        hue="year",
        kind="line",
        palette="crest",
        linewidth=4,
        zorder=5,
        col_wrap=3,
        height=2,
        aspect=1.5,
        legend=False,
    )

    for year, ax in relplot.axes_dict.items():
        # Add the title as an annotation within the plot
        ax.text(0.8, 0.85, year, transform=ax.transAxes, fontweight="bold")
        # Plot every year's time series in the background
        sns.lineplot(
            data=flights,
            x="month",
            y="passengers",
            units="year",
            estimator=None,
            color=".7",
            linewidth=1,
            ax=ax,
        )
    # Reduce the frequency of the x axis ticks
    ax.set_xticks(ax.get_xticks()[::2])
    # other tweaks
    relplot.set_titles("")
    relplot.set_axis_labels("", "Passengers")
    relplot.tight_layout()


make_flights_relplot()

### Summary: Matplotlib, Seaborn, and Pandas

Matplotlib is a powerful and expressive visualisation library, but
* can be verbose to produce more complex plots
* stateful API can be counter-intuitive 
* does not support interactivity well (but can create [animated plots](https://matplotlib.org/stable/api/animation_api.html))

Well-suited contexts of use:
* Creating high-quality bespoke visualisations needed for publication (see [How to make beautiful data visualizations in Python with matplotlib](http://www.randalolson.com/2014/06/28/how-to-make-beautiful-data-visualizations-in-python-with-matplotlib/))
* Use via Pandas for rapid exploratory data analysis  
* Use via Seaborn if it's a good fit for your analysis


## Dynamic visualisation with Plotly, Bokeh, and Altair

Web-first visualisation libraries based on JavaScript, that all have interactive features out of the box

### Plotly

We're using [Plotly Express](https://plotly.com/python/plotly-express/), Plotly's higher-level APi for creating Plotly figures.

In [None]:
import plotly.express as px

figure = px.line(
    sc_march_2020_df,
    x="Date_Time",
    y="Hourly_Counts",
    title="Hourly counts for South Bank station, March 2020",
)

figure

## Bokeh

We're using [Holoviews](https://holoviews.org) as a high-level libray on top of [Bokeh](https://bokeh.org/).

In [None]:
import holoviews as hv
hv.notebook_extension()
hv.extension('bokeh')

plot = hv.Curve(sc_march_2020_df, "Date_Time", "Hourly_Counts")
plot.opts(frame_width=900, frame_height=250)

## Altair

Altair is a declarative API for producing visualisations based on the [Vega-Lite](http://vega.github.io/vega-lite/) visualization grammar.

Based on a grammar of graphics (like [ggplot](https://ggplot2.tidyverse.org/)), that also includes a grammar of interactive graphics.

In [None]:
import altair as alt

alt.Chart(sc_march_2020_df).mark_line().encode(
    x="Date_Time:T", y="Hourly_Counts:Q"
).properties(width=1000)

Interaction is not configured by default. You need to wire this up. 

This involves a bit more [configuration](https://altair-viz.github.io/user_guide/interactions.html), 
but it means that you have more flexible interaction capabilities available to you.

In [None]:
scales = alt.selection_interval(bind="scales")

alt.Chart(sc_march_2020_df).mark_line().encode(
    x="Date_Time:T", y="Hourly_Counts:Q"
).properties(width=1000).add_selection(scales)

## Reactive Interfaces

Wrapping up visualisation code into functions/classes help make more reusable.

But still slow to interact with; not an ideal interface.

In [None]:
data.filter(year=2017, month="March", sensor="Southbank").plot(
    "sensor_traffic"
)

<center>
    <img src="/files/notebooks/img/reactive.svg" width=1000/>
</center>

The rest of the libraries we will look at follow this fundamental pattern in some way. The main distinction is whether the code is run within the web client (in JavaScript), or whether it's run in a different process with Python.

Local callbacks offer more responsivity and enable embedding the visualisation within a statically hosted HTML page, but are limited to working with smaller amounts of data.

Python callbacks enable working with much larger datasets, and the full power of custom Python code to define callback logic, but require the relevant server process to be configured and run somewhere and will be less responsive than clientside callbacks, especially when the server process is running on a different machine to the client.

## Altair
_client-side_

In [None]:
def altair_interactive_sensor_traffic(dataset, year):
    filtered_dataset = dataset.filter(year=year)

    sensor_input = alt.binding_select(options=filtered_dataset.sensors)
    sensor_selection = alt.selection_single(
        fields=["Sensor_Name"],
        bind=sensor_input,
        init={"Sensor_Name": filtered_dataset.sensors[0]},
    )
    
    month_input = alt.binding_select(options=filtered_dataset.months)
    month_selection = alt.selection_single(
        fields=["Month"],
        bind=month_input,
        init={"Month": filtered_dataset.months[0]},
    )

    return (
        alt.Chart(filtered_dataset.df)
        .mark_line()
        .encode(x="Date_Time:T", y="Hourly_Counts:Q")
        .add_selection(month_selection)
        .add_selection(sensor_selection)
        .transform_filter(month_selection)
        .transform_filter(sensor_selection)
        .properties(width=1000)
    )

**Hot Tip:** wrap up code to make plots into functions. Useful for:
 - parameterising you plot and facilitating code reuse
 - not polluting the global namespace

Challenge is that _all_ the data required for the plot must be loaded into the client as JSON-defined Vega-Lite specification, which limits the amount of data you can work with.

Let's work around this by first filtering down to a single year in Python, and even then we still have to disable the max rows limit in Altiar.

The plot takes a little while to load...

In [None]:
alt.data_transformers.disable_max_rows()

altair_interactive_sensor_traffic(data, year=2020)

## Plotly Express
_client-side_

The animation feature of Plotly Express gets up some useful interactivity for exploring some types of data, but otherwise, we'll need to turn to Dash.

In [None]:
gapminder_df = px.data.gapminder()
px.scatter(
    gapminder_df,
    x="gdpPercap",
    y="lifeExp",
    size="pop",
    color="continent",
    hover_name="country",
    log_x=True,
    size_max=60,
    animation_frame="year",
    range_y=[25, 90],
    title="GDP per capita compared with life expectency over time"
)

## ipywidgets
_server-side_

[ipywidgets](https://ipywidgets.readthedocs.io/en/latest/) is a library of interactive widgets that can be combined with other visualisation libraries to create interactive interfaces within Jupyter Notebooks.

In the code below, we combine with Plotly to allow quick filtering of the sensor traffic plot to specific months, years, and sensors.

In [None]:
from ipywidgets import interact, Dropdown, HBox, VBox

# TODO: arrange the controls more nicely with HBox and VBox

def interactive_sensor_traffic(dataset):
    """Make an interactive"""
    year_widget = Dropdown(options=dataset.years)
    month_widget = Dropdown(options=dataset.months)
    sensor_widget = Dropdown(options=data.sensors)

    def update_widgets(*args):
        """Update month and sensor values to be only those available for selected year"""
        filtered_data = dataset.filter(year=year_widget.value)
        month_widget.options = filtered_data.months
        sensor_widget.options = filtered_data.sensors

    # register update_widgets as callback to be run on year change
    year_widget.observe(update_widgets)

    @interact(year=year_widget, month=month_widget, sensor=sensor_widget)
    def plot(year, month, sensor):
        """Plot the sensor traffic for selected year, month, and sensor"""
        filtered_data = dataset.filter(year=year, month=month, sensor=sensor)
        if len(filtered_data.df) == 0:
            return f"No records for {year}, {month}, {sensor}"
        return filtered_data.plot("sensor_traffic")


In [None]:
interactive_sensor_traffic(data)

***
<center>
    <img src="/files/notebooks/img/voila.svg" width=1000/>
</center>

Run from a terminal to launch the Voila demo notebook/dashboard:

    $ voila demos/voila.ipynb

***
<center>
    <img src="/files/notebooks/img/panel.svg" width=900/>
</center>

In [None]:
import panel as pn

hv.extension("plotly")
pn.extension("plotly")


def panel_interactive_sensor_traffic(dataset):
    
    def plot_traffic(year, month, sensor):
        """Plot the sensor traffic for selected year, month, and sensor"""
        global foo
        filtered_data = dataset.filter(year=year, month=month, sensor=sensor)
        if len(filtered_data.df) == 0:
            return f"No records for {year}, {month}, {sensor}"
        fig = filtered_data.get_fig("sensor_traffic")
        return pn.pane.Plotly(fig)

    year_select = pn.widgets.Select(name="Year", options=dataset.years)
    month_select = pn.widgets.Select(name="Month", options=dataset.months)
    sensor_select = pn.widgets.Select(name="Sensor", options=dataset.sensors)
    reactive_plot = pn.bind(plot_traffic, year_select, month_select, sensor_select)

    controls = pn.Column(
        "<br>\n# Sensor Traffic", year_select, month_select, sensor_select
    )
    return pn.Row(controls, reactive_plot)


panel_interactive_sensor_traffic(data)

***

<center>
    <img src="/files/notebooks/img/dash.svg" width=900/>
</center>

Can run from JupyterLab using JupyterDash:

In [None]:
from dash import Input, Output, State, dcc, html
from jupyter_dash import JupyterDash

from melbviz.utils import make_options


app = JupyterDash(__name__)

controls = html.Div(
    id="controls",
    style={"width":500},
    children=[
        html.Div(
            [
                html.Label("Year"),
                dcc.Dropdown(
                    id="year-input",
                    className="input",
                    options=make_options(data.years),
                    value=data.years[-1],
                ),
            ]
        ),
        html.Div(
            [html.Label("Month"), dcc.Dropdown(id="month-input", className="input")]
        ),
        html.Div(
            [
                html.Label("Sensor"),
                dcc.Dropdown(id="sensor-input", multi=True, className="input"),
            ]
        ),
    ],
)

app.layout = html.Div([controls, dcc.Graph(id="sensor-traffic")])


@app.callback(
    Output("month-input", "options"),
    Output("month-input", "value"),
    Output("sensor-input", "options"),
    Output("sensor-input", "value"),
    Input("year-input", "value"),
)
def update_inputs(year):
    new_data = data.filter(year)
    return (make_options(new_data.months), None, make_options(new_data.sensors), None)


@app.callback(
    Output("sensor-traffic", "figure"),
    Input("year-input", "value"),
    Input("month-input", "value"),
    Input("sensor-input", "value"),
)
def sensor_traffic(year, month, sensor):
    figure = data.filter(year, month, sensor).get_fig("sensor_traffic")
    return figure

Then run your app with one of these modes:
* `inline` display cell's output area in the notebook
* `external` open a new tab to display
* `jupyterlab` display in a separate JupyterLab tab

In [None]:
app.run_server(mode='inline')

Can run as a standalone web app, eg **runapp.py**:


```python
!/usr/bin/env python                                                                                                                                                         
from melbviz.app import app


if __name__ == "__main__":
    app.run_server(debug=True, port="8887")

```

***

<center>
    <img src="/files/notebooks/img/streamlit.svg" width=1000/>
</center>

## Deploying your Apps

Altair can be deployed statically.

The rest require you to run your app somewhere.

* Your own server
* A hosted virtual machine: eg AWS EC2, Digital Ocean, Heroki
* Serverless cloud services:
  * AWS: Elastic Beanstalk, Fargate, Lambda
  * GCP: App Engine, Cloud Run


***
<br>
<center>
    <img src="/files/notebooks/img/python_viz_libs.svg" width=1000/>
</center>

## Suggestions for choosing a library

### For general visualisation
* **Matplotlib** is good for bespoke static visualisations 
* **Seaborn** is good for building more statistically-oriented static plots
* **Pandas** (with Matplotlib) is great for rapid exploratory analysis
* **Altair** has a delightful AP and is good all-round library if your dataset is not too big, and can be embedded into static plots
* **HoloViews** is a powerful abstraction API that can target all of: Matplotlib, Bokeh, and Plotly 
* Plotly offers a rich ecosystem: Plotly Express is consise way to rapidly build advanced plots, and has useful ability to extend easily to Dash apps. 


### Getting the right level of abstraction
* Unless needed, avoid working primarily with the lower-level Matplotlib, Bokeh, or Plotly, libraries
* Instead use one of the high-level libraries for better productivity.
* However being familiar with the lower level API is important, customising plots producted by high-level libraries is an effective pattern.


### Interactive apps
* **ipywidgets** is a flexible tool that can be combined with all the libraries mentioned here to make them reactive
* **Voila** lets you easily deploy your ipywidgets-based notebooks as dashboards
* **Panel** works well if you have a heavy oriented workflow and want a balance of customisability with preconfigured styling.
* **Dash** gives you the ability to make scalable and heavily customisable web apps (best option for deployment to larger user base), but required more effort to visually style.
* **Streamlit** Good for rapidly building custom data tools without having to worry about layout or aesthetics

 
One library won't always be right; have more than one tool in your toolkit. Tryout different libraries and see which ones feel good.

## Data Viz Hot Tips

* Try to use high-level visualisation libraries where possible  
* _Always_ title and label your plots
* Abstract code to produce distinct visualisations into functions
* If you use the functions, across more than one notebook, move them into a custom Python package.