This notebook is a RISE presentation. 

In addition to the `RISE` Python package, you will also need the `hide_code` Python package installed.

To run, load [this notebook in the classic Jupyter Notebook](/notebooks/notebooks/talk_data_viz.ipynb) and click the _Enter/Exit RISE Slideshow_ button.

In [3]:
%load_ext autoreload
%autoreload 2

# import modules and load data required for this presentation

import os
from pathlib import Path

from ipywidgets import interact, fixed, HBox, Output, Dropdown
from IPython.display import display
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go

from melbviz.pedestrian import PedestrianDataset
from melbviz.config import MELBVIZ_DATA_PATH, MELBVIZ_COUNTS_CSV_PATH, MELBVIZ_SENSOR_CSV_PATH

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


<center>
    <h1>Interactive Data Visualisation with Python</h1>
    <br><br><br>
    <img src="/files/notebooks/img/title_viz.png" width="500"/>
    <br>
    <img src="/files/notebooks/img/python_logo.svg" width="300"/>
</center>

In [5]:
%%html 

<style>
.people-table {
    border-spacing: 3em !important;
    border-collapse: separate !important;
    color: white !important;
}

.people-table tr {
    background: white !important;
}

.people-table img {
    box-shadow: 1px 1px 5px 1px black !important;
}
</style>

## Who are we?

<table class="people-table">
    <tr>
        <td><img src="/files/notebooks/img/ned.jpg"/></td>
    </tr>
    <tr>
        <td><center><h4>Ned Letcher</h4></center></td>
    </tr>
</table>

<center>
    <img src="/files/notebooks/img/thoughtworks.png"/>
</center>

## Where are we?

In [11]:
px.scatter_mapbox(lat=[-37.8136], lon=[144.9631], hover_name=["Melbourne"], zoom=1.3)

## What is data visualisation?

* Graphic representation of data that visually encodes information

* Reveals patterns, trends, relationships

* Used to discover and communicate insights

<center>
    <h3>Examples of Visualisations</h3>
    <img src="/files/notebooks/img/plot_types.svg" width="800"/>
</center>

In [7]:
# prep data for the next section

data = PedestrianDataset.load(MELBVIZ_COUNTS_CSV_PATH, sensor_csv_path=MELBVIZ_SENSOR_CSV_PATH)
pedestrian_data = data.filter(year=2019, month="March", sensor="Southern Cross Station")
pedestrian_df = pedestrian_data.df
data.years = list(reversed(data.years))

def filter_df(year=None, month=None, sensor=None):
    return data.filter(year=year, month=month, sensor=sensor).df

def plot_sensor_traffic(df):
    fig_func = data.get_plot_func("sensor_traffic")
    fig = fig_func(df, width=1500, height=500)
    fig.update_layout(font_size=18, margin_r=170)
    return fig

## Why visualise data?

_Tabular representations of datasets are difficult to interpret_

In [5]:
pedestrian_df.head(15)

Unnamed: 0,Date_Time,Year,Month,Mdate,Day,Time,Sensor_ID,Sensor_Name,Hourly_Counts,datetime_flat_year,sensor_id,latitude,longitude
1630174,2019-03-01 00:00:00,2019,March,1,Friday,0,9,Southern Cross Station,28,2000-03-01 00:00:00,9,-37.81983,144.951026
1630175,2019-03-01 01:00:00,2019,March,1,Friday,1,9,Southern Cross Station,19,2000-03-01 01:00:00,9,-37.81983,144.951026
1630176,2019-03-01 02:00:00,2019,March,1,Friday,2,9,Southern Cross Station,5,2000-03-01 02:00:00,9,-37.81983,144.951026
1630177,2019-03-01 03:00:00,2019,March,1,Friday,3,9,Southern Cross Station,8,2000-03-01 03:00:00,9,-37.81983,144.951026
1630178,2019-03-01 04:00:00,2019,March,1,Friday,4,9,Southern Cross Station,11,2000-03-01 04:00:00,9,-37.81983,144.951026
1630179,2019-03-01 05:00:00,2019,March,1,Friday,5,9,Southern Cross Station,134,2000-03-01 05:00:00,9,-37.81983,144.951026
1630180,2019-03-01 06:00:00,2019,March,1,Friday,6,9,Southern Cross Station,578,2000-03-01 06:00:00,9,-37.81983,144.951026
1630181,2019-03-01 07:00:00,2019,March,1,Friday,7,9,Southern Cross Station,2065,2000-03-01 07:00:00,9,-37.81983,144.951026
1630182,2019-03-01 08:00:00,2019,March,1,Friday,8,9,Southern Cross Station,4406,2000-03-01 08:00:00,9,-37.81983,144.951026
1630183,2019-03-01 09:00:00,2019,March,1,Friday,9,9,Southern Cross Station,2801,2000-03-01 09:00:00,9,-37.81983,144.951026


## Why visualise data?

_Visual representations help patterns jump out_

In [6]:
plot_sensor_traffic(pedestrian_df)

## Why visualise data?

_Summary statistics mislead_

* mean
* median
* standard deviation
* correlations between variables

In [9]:
def object_as_widget(obj):
    out = Output()
    with out:
        display(obj)
    return out


def show_datasaurus(datasauraus_df, column):
    if column == "all":
        df = datasauraus_df
    else:
        df = datasauraus_df[datasauraus_df["dataset"] == column]
    stats_df = pd.DataFrame({
        "statistic": ["x_mean", "y_mean", "x_std", "y_std", "corr"],
        "value": [
            df["x"].mean(), df["y"].mean(), df["x"].std(), df["y"].std(), df["x"].corr(df["y"])
        ],
    })
    if column == "all":
        fig = px.scatter(datasauraus_df, facet_col_wrap=5, facet_col="dataset", x="x", y="y")
        fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
    else:
        fig = px.scatter(df, x="x", y="y")
    fig.update_layout(
        margin={"r": 10, "t": 40, "b": 10},
        font_size=18,
        width=600,
        height=500,
    )
    return HBox([object_as_widget(stats_df), go.FigureWidget(fig)])


def make_datasaurus(path="data/DatasaurusDozen.tsv"):
    datasauraus_df = pd.read_csv(path, delimiter="\t")
    columns = list(datasauraus_df["dataset"].unique())
    columns.append("all")
    widget = interact(show_datasaurus, datasauraus_df=fixed(datasauraus_df), column=columns)
    return widget

## The Datasaurus

https://www.autodesk.com/research/publications/same-stats-different-graphs

In [10]:
make_datasaurus();

interactive(children=(Dropdown(description='column', options=('dino', 'away', 'h_lines', 'v_lines', 'x_shape',…

## Why visualise data?

_Human visual system is a powerful tool_

<center>
    <img src="/files/notebooks/img/horizon_plot.jpg" style="height:80vh"/>
    <small><a href="https://twitter.com/xangregg/status/883763762381152256">@xangregg</a></small>
</center>

## Why interactive visualisations?

* more ergonomic data analysis

* faster time to insights

* self-service insights (eg dashboards)

* agile prototypes

## Why _interactive_ visualisations?

In [14]:
plot_sensor_traffic(pedestrian_df)

In [10]:
from melbviz.utils import sort_months

year_widget = Dropdown(options=data.years)
month_widget = Dropdown(options=data.months)
sensor_widget = Dropdown(options=data.sensors)


def update_widgets(*args):
    df = filter_df(year=year_widget.value)
    month_widget.options = sort_months(df["Month"].unique())
    sensor_widget.options = sorted(df["Sensor_Name"].unique())

year_widget.observe(update_widgets)


In [11]:
@interact(year=year_widget, month=month_widget, sensor=sensor_widget)
def plot(year, month, sensor):
    pedestrian_df = filter_df(year=year, month=month, sensor=sensor)
    if len(pedestrian_df) == 0:
        return f"No records for {year}, {month}, {sensor}"
    figure = plot_sensor_traffic(pedestrian_df)
    return figure

interactive(children=(Dropdown(description='year', options=(2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 20…

## Python Data Viz Libraries

<center>
    <img src="/files/notebooks/img/python_viz_landscape.svg" style="height:90vh"/>
</center>

<center>
    <img src="/files/notebooks/img/python_viz_libs.svg" style="height:90vh"/>
</center>