In [None]:
# notebook setup

# automatically reload modules when they change
%load_ext autoreload
%autoreload 2

# Hide scary looking warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# display value of cell-final assignment statement in addition to expressions 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "last_expr_or_assign"

# Interactive Data Visualisation with Python

## Jupyter Notebooks

Interactive development environment (IDE) that's great for data analysis.

A Jupyter Notebook is a sequence of _cells_ that are of two main types:
1. **Text cells** (using Markdown)
2. **Code cells** (Python for us)

Cells are run, and produce visual outputs:
* Standard output produced by Python 
* Tabular data
* Visualisations


## The Dataset

The Melbourne City Council's Pedestrian Counting System datasets.

_This dataset contains hourly pedestrian counts since 2009 from pedestrian sensor devices located across the city. The data is updated on a monthly basis and can be used to determine variations in pedestrian activity throughout the day._

Two separate datasets:

1. [The Pedestrian Counting System dataset](https://data.melbourne.vic.gov.au/Transport/Pedestrian-Counting-System-2009-to-Present-counts-/b2ak-trbp), which contains the hourly traffic data.
2. [Pedestrian Sensor Locations](https://data.melbourne.vic.gov.au/Transport/Pedestrian-Counting-System-Sensor-Locations/h57g-5234) dataset, which contains data about the sensors collecting the above data. 

In [None]:
import os
from pathlib import Path

data_path = Path(os.getenv("DATA_PATH", "../data"))
sensor_csv_path = data_path / "Pedestrian_Counting_System_-_Sensor_Locations.csv"
counts_csv_path = data_path / "Pedestrian_Counting_System___2009_to_Present__counts_per_hour_.csv"

## Outcomes

What questions are we trying to answer?

1. _What does monthly traffic look like over the years?_
2. _What are the most trafficked parts of Melbourne’s CBD?_
4. _What does daily traffic look like?_

### A Crash Course in Pandas

Pandas is a Python tool for general purpose data manipulation and analysis.

Kind of like Excel + SQL for Python... but so much more!

Two main data types:
* `DataFrame`
* `Series`


Let's load the dataset of hourly pedestrian counts:

In [None]:
import pandas as pd

counts_df = pd.read_csv(counts_csv_path, index_col="ID", parse_dates=["Date_Time"]);

In [None]:
type(counts_df)

In [None]:
counts_df.head()

In [None]:
counts_df.dtypes

#### Working with DataFrame and Series Objects

Think of `DataFrame`s as 2-dimensional data structure: **rows** x **columns**

In [None]:
counts_df.shape

Each column of a `DataFrame` is a `Series`.

A `Series` is a 1-dimensional data structure.

Like a Python list, but:
* has a type
* has an index

In [None]:
counts_df["Hourly_Counts"]

In [None]:
# filtering the dataframe to Southbank Sensor records

southbank_df = counts_df[counts_df["Sensor_Name"] == "Southbank"]

In [None]:
# filtering further to only saturday 

southbank_sat_df = southbank_df[southbank_df["Day"] == "Saturday"]

In [None]:
# Total pedestrians through Southbank on Saturday

southbank_sat_df["Hourly_Counts"].sum()

In [None]:
# vectorised arithmetic

southbank_sat_df["Hourly_Counts"] + 1

**Key points:**

When working with Pandas, you are manipulating `DataFrame` objects and their columns, which are `Series` objects.

Try to use vectorised operations over these data structures rather than `for` loops.

## Exploratory Data Analysis (EDA)

Dive into your data and get your hands dirty in order to:
* Identify data quality or integrity issues
* Understand applications the data supports (and does not support!)

Involves: 
* Extracting summary stastistics of the data
* Visualising your data

### Let's do some exploratory analysis on records from 2019

In [None]:
# filter down to 2019

counts_2019_df = counts_df[counts_df["Year"] == 2019]

In [None]:
# group by Month

months_2019 = counts_2019_df.groupby("Month")["Hourly_Counts"].sum()

In [None]:
# get various summary statistics

months_2019.describe()

In [None]:
# idxmax and idxmin are like max and min but return indexes

print(f"Busiest month is: {months_2019.idxmax()}")
print(f"Least busy month is: {months_2019.idxmin()}")

### Visual Exploration with Plotly Express

<center>
    <img src="img/python_viz_libs.svg" style="height:70vh"/>
</center>

In [None]:
months_2019

In [None]:
# turn the series into a DataFrame

months_2019_df = months_2019.reset_index()

In [None]:
import plotly.express as px

px.line(months_2019_df, x="Month", y="Hourly_Counts")

Problems:
1. It's alphabetically sorted
2. 1. y-axis is truncated
3. Data points aren't easily identifable
4. Plot is not well documented

Let's fix these!

In [None]:
# sort by month numbers
from datetime import datetime

sorted_months_2019_df = months_2019_df.sort_values(
    by="Month", 
    key=lambda series:pd.to_datetime(series, format="%B").dt.month
)

In [None]:
px.line(sorted_months_2019_df, x="Month", y="Hourly_Counts")

In [None]:
# add markers to data points

figure = px.line(sorted_months_2019_df, x="Month", y="Hourly_Counts")
figure.update_traces(mode='lines+markers')

#### __*Hot Tip:*__ Always title your plots!

In [None]:
# Improve accessability of the plot with title and better y-axis label

figure = px.line(
    sorted_months_2019_df, 
    x="Month", 
    y="Hourly_Counts", 
    title="Monthly Total Pedestrian Counts"
)
figure.update_traces(mode='lines+markers')
figure.update_layout(yaxis_title="Total Pedestrian Counts", title_x=0.5)

## Q1: _What does monthly traffic look like across years?_

We have a way to visualise monthly traffic over a single year.

Problems:
* It's spread over many notebooks cells
* Changing the year is annoying


#### __*Hot Tip:*__ Move code into reusable functions!

In [None]:
def plot_months(counts_df, year):
    """Plot Monthly traffic for a given year."""
    
    # 1. Collect and shape data
    year_df = counts_df[counts_df["Year"] == year]
    months_df = year_df.groupby("Month")["Hourly_Counts"].sum().reset_index()
    sorted_months_df = months_df.sort_values(
        by="Month", 
        key=lambda x:pd.to_datetime(x, format="%B").dt.month
    )

    # 2. Make plot
    figure = px.line(
        sorted_months_df, 
        x="Month", 
        y="Hourly_Counts", 
        title="Monthly Total Pedestrian Counts"
    )
    
    # 3. Fine-tune plot's appearance 
    figure.update_traces(mode='lines+markers')
    figure.update_layout(yaxis_title="Total Pedestrian Counts", title_x=0.5)
    
    return figure

Note common pattern when building plots:

1. Collect and shape data
2. Make plot
3. Fine-tune plot's appearance

In [None]:
plot_months(counts_df, 2020)

Problem: Answering questions about different years is still slower than would be nicce.


### Using ipywidgets to make an interactive tool!

In [None]:
from ipywidgets import interact, fixed

interact(plot_months, year=2019, counts_df=fixed(counts_df))

In [None]:
interact(plot_months, year=range(2009, 2021), counts_df=fixed(counts_df));

Another problem: truncated y-axis is misleading when comparing.

In [None]:
# add a truncate_y parameter to our function

def plot_months(counts_df, year, truncate_y=False):
    """Plot Monthly traffic for a given year."""
    
    # 1. Collect and shape data
    year_df = counts_df[counts_df["Year"] == year]
    months_df = year_df.groupby("Month")["Hourly_Counts"].sum().reset_index()
    sorted_months_df = months_df.sort_values(
        by="Month", 
        key=lambda x:pd.to_datetime(x, format="%B").dt.month
    )

    # 2. Make plot
    figure = px.line(
        sorted_months_df, 
        x="Month", 
        y="Hourly_Counts", 
        title="Monthly Total Pedestrian Counts"
    )
    
    # 3. Fine-tune plot's appearance 
    figure.update_traces(mode='lines+markers')
    figure.update_layout(yaxis_title="Total Pedestrian Counts", title_x=0.5)
    if not truncate_y:
        figure.update_layout(yaxis_rangemode='tozero')
    
    return figure

In [None]:
interact(plot_months, year=range(2009, 2021), counts_df=fixed(counts_df));

#### __*Hot Tip:*__ Share you interactive visualisation with ngrok and Voila

### Abstracting The filtering out of our plotting function

In [None]:
import numbers
import collections

def filter_df(df, year=None, month=None, sensor=None):
    """Filter a pedestrian counts DataFrame

    All params {year, month, sensor} take a value or sequence of values
    filtering the input DataFrame to only rows that matches those values. (A
    sequence filters to rows with fields matching *any* value in the sequence)
    """
    params = {"Year": year, "Sensor_Name": sensor, "Month": month}
    for param, param_val in params.items():
        if param_val is None:
            continue
        elif is_value(param_val):
            param_val = [param_val]
        elif isinstance(param_val, collections.abc.Iterable):
            param_val = list(param_val)
        else:
            raise Exception(
                f"Invalid value {param_val}, params must be str, numeric, or"
                " an iterable"
            )
        if len(param_val) == 0:
            continue
        df = df[df[param].isin(set(param_val))]
    return df


def is_value(obj):
    """Check if an object is a string or numeric value"""
    return isinstance(obj, str) or isinstance(obj, numbers.Number)

In [None]:
# get all records from 2020, for the following sen:
sensors = ["Lygon St (East)", "Lygon St (West)", "Faraday St-Lygon St (West)"]
lygon_2020_df = filter_df(counts_df, year=2020, sensor=sensors)

### A different version

In [None]:
# second version by sensors with bar charts

def plot_month_counts(df, split_sensors=False, **kwargs):
    """Make a bar plot of monthly counts"""
    if split_sensors:
        group_cols = ["Month", "Sensor_Name"]
        color = "Sensor_Name"
    else:
        group_cols = ["Month"]
        color = None
    
    # 1. Collect and shape data
    month_df = (
        df.groupby(group_cols)["Hourly_Counts"]
        .sum()
        .reset_index()
        .sort_values(by="Month", key=lambda x: pd.to_datetime(x, format="%B").dt.month)
    )

    # 2. Make plot
    figure = px.bar(
        month_df,
        x="Month",
        y="Hourly_Counts",
        barmode="group",
        color=color,
        title="Monthly Sensor Traffic",
        **kwargs,
    )

    # 3. Fine-tune plot's appearance
    figure.update_layout(
        title_x=0.5,
        yaxis_title="Total Counts",
        yaxis_showgrid=False,
        yaxis_zeroline=False,
        xaxis_title=None,
        legend=dict(
            title_text="",
            orientation="h",
            yanchor="bottom",
            y=-0.6,
            xanchor="right",
            x=1,
        ),
    )
    return figure

In [None]:
plot_month_counts(lygon_2020_df, split_sensors=True)

In [None]:
sensors = ["Lygon St (East)", "Lygon St (West)", "Faraday St-Lygon St (West)", "Southbank"]
new_2020_df = filter_df(counts_df, year=2020, sensor=sensors)
plot_month_counts(new_2020_df, split_sensors=True)

## Q2: _What does daily traffic look like?_

In [None]:
def plot_sensor_traffic(
    df,
    same_yscale=False,
    row_height=150,
    limit=5,
    **kwargs,
):
    """Plot hourly traffic for one or more sensors"""
    
    # 1. Collect and shape data
    target_sensors = (
        df.groupby("Sensor_Name")["Hourly_Counts"].sum().sort_values(ascending=False)
    )[:limit]
    df = df[df["Sensor_Name"].isin(set(target_sensors.index))]

    # 2. Make plot
    if "height" not in kwargs:
        kwargs["height"] = max(len(target_sensors) * row_height, 400)
    figure = px.line(
        df,
        y="Hourly_Counts",
        x="Date_Time",
        facet_row="Sensor_Name",
        title="Hourly Pedestrian Traffic by Sensor",
        category_orders={"Sensor_Name": list(target_sensors.index)},
        **kwargs,
    )
    
    # 3. Fine-tune plot's appearance
    figure.update_layout(title_x=0.5)
    figure.update_yaxes(
        matches=None if same_yscale else "y",
        showgrid=False,
        zeroline=False,
        title_text=None,
    )
    figure.update_xaxes(showgrid=True, title_text=None)
    figure.for_each_annotation(
        lambda a: a.update(textangle=0, text=a.text.split("=")[-1])
    )
    return figure

In [None]:
counts_2020_df = filter_df(counts_df, year=2020, sensor="Southbank")
plot_sensor_traffic(counts_2020_df)

## Q3: _What are the most trafficked parts of Melbourne’s CBD?_

In [None]:
# load the sensors datasets

sensors_df = pd.read_csv(sensor_csv_path, index_col="sensor_id")

In [None]:
from melbviz.config import MAPBOX_KEY

px.set_mapbox_access_token(MAPBOX_KEY)


def plot_sensor_map(df, **kwargs):
    """Plot a spatial scatter plot of sensor traffic."""
    
    # 1. Collect and shape data
    sensor_totals_df = (
        df.groupby("Sensor_Name")
        .agg(
            {
                "Hourly_Counts": sum,
                "latitude": lambda x: x.iloc[0],
                "longitude": lambda x: x.iloc[0],
            }
        )
        .reset_index()
        .rename(columns={"Hourly_Counts": "Total Counts"})
    )
    
    # 2. Make plot
    figure = px.scatter_mapbox(
        sensor_totals_df,
        lat="latitude",
        lon="longitude",
        color="Total Counts",
        size="Total Counts",
        text="Sensor_Name",
        color_continuous_scale=px.colors.sequential.Plasma,
        size_max=50,
        zoom=13,
        title="Sensor Traffic",
        **kwargs,
    )

    # 3. Fine-tune plot's appearance
    figure.update_layout(title_x=0.5)
    return figure

This function requires sensor lat/lon information, so we'll use an already joined dataset:

In [None]:
from melbviz.pedestrian import PedestrianDataset
from melbviz.config import DATA_PATH, COUNTS_CSV_PATH, SENSOR_CSV_PATH

data = PedestrianDataset.load(COUNTS_CSV_PATH, sensor_csv_path=SENSOR_CSV_PATH);

In [None]:
counts_2020_df = filter_df(counts_df, year=2020)
plot_sensor_map(data.filter(year=2020).df, height=800)

### How do we share our interfaces?

Option 1: Voila + ngrok

#### Option 2: Dash and deploy somewhere

## Some Resources

### Getting Started with Jupyter Lab
I recommend the Anaconda Python distribution: [anaconda.com](https://www.anaconda.com)

<img src="img/anaconda_logo.png" width="300"/>



### Tools for Building Data Apps

<img src="img/data_app_libs.png" width="300"/>

See my talk surveying all of them: [youtu.be/jI5zLf9Hvd8](https://youtu.be/jI5zLf9Hvd8)