<img src="https://github.com/jupytercon/2020-exactlyallan/raw/master/images/RAPIDS-header-graphic.png">

# Exploratory Data Visualization
***Quickly finding linked patterns in your data***


## Overview

Taking the previous notebook’s vetted Divvy bike share dataset, we will now use, cuDF, cuxfilter, and cuGraph to quickly create cross-filtered visualizations to explore different perspectives and slices of the data in search of interesting patterns. 

### cuxfilter and cuGraph
- [cuDF](https://docs.rapids.ai/api/cudf/stable/) is a RAPIDS GPU DataFrame library for manipulating data with a pandas-like API.

- [cuxfilter](https://docs.rapids.ai/api/cuxfilter/nightly/) is a RAPIDS viz project. Focused around cross-filtering data, its designed to quickly build linked dashboards powered by cuDF compute capabilities. Cuxfilter acts as a connector library rather than a visualization library. It abstracts away all the 'plumbing' required to connect a [curated list of visualizations](https://docs.rapids.ai/api/cuxfilter/nightly/charts/charts.html) to a GPU dataframe. By simply enabling accelerated dashboards inline within a notebook workflow, cuxfilter allows analysts to get to exploring their data faster.

- [cuGraph](https://docs.rapids.ai/api/cugraph/stable/) is a RAPIDS GPU accelerated graph analytics library with functionality like NetworkX.

 ### Cuxfilter Examples
Learn about the detailed capabilities of cuxfilter in our [API Documentation](https://docs.rapids.ai/api/cuxfilter/stable/charts/charts.html) or click the examples below:

<br><br>

<img src="https://raw.githubusercontent.com/rapidsai/cuxfilter/branch-0.16/docs/_images/demo.gif" width="700" height="600" /> <br>
<p style="text-align: center">
    <a href="https://github.com/rapidsai/cuxfilter#example-1"> Example Dashboard 1</a>
</p>

<br><br>

<img src="https://raw.githubusercontent.com/rapidsai/cuxfilter/branch-0.16/docs/_images/demo2.gif" width="700" height="600" /><br>
<p style="text-align: center">
    <a href="https://github.com/rapidsai/cuxfilter#example-2">Example Dashboard 2</a>
</p>



## Imports
Let's first make sure the necessary imports are present to load, as well as setting the data location.

In [None]:
import cuxfilter
import cudf
import cugraph
from bokeh.models import NumeralTickFormatter
from pyproj import Proj, Transformer
from pathlib import Path

## Load Data into cuDF
As before, load `datda.csv` into the GPU dataframe:

In [None]:
DATA_DIR = Path("../data")
FILENAME = Path("data.csv")

data = cudf.read_csv(DATA_DIR / FILENAME)

## Data Preprocessing
Before we can visualize the data, we need to do some preprocessing to make it more human readable and usable for cuxfilter.

First we need to transform the x/y coordinates from its original [espg4326 projection](https://epsg.io/4326) to the spherical [epsg:3857 projection](https://epsg.io/3857) that works with the maptile underlays used in cuxfilter:

In [None]:
def transform_coords(df, x='x', y='y'):
    transform_4326_to_3857 = Transformer.from_crs('epsg:4326', 'epsg:3857')
    df['x'], df['y'] = transform_4326_to_3857.transform(df[x].to_array(), df[y].to_array())
    return df
# Apply Transformation
trips = transform_coords(data, x='latitude_start', y='longitude_start')

Based on our previous finding about the apparent difference between weekends and weekdays, we will want to extract `day_type` from the dataset:

In [None]:
# Note: days 0-4 are weekedays, days 5-6 are weekends 
trips['day_type'] = 0
trips.loc[trips.query('day>4').index, 'day_type'] = 1

Choosing the appropriate fidelity of data to show always takes some trial and error. Showing total trips of every day for every year can be noisy, while showing by month is not granular enough. We settled on weeks. That means we will want to get the global week number in the dataset:

In [None]:
# Note: Data always has edge cases, such as the extra week anomalies of 2015 and 2016:
# trips.groupby('year').week.max().to_pandas().to_dict() is {2014: 52, 2015: 53, 2016: 53, 2017: 52}
# Since 2015 and 2016 have 53 weeks, we add 1 to global week count for their following years - 2016 & 2017
# (data.year/2016).astype('int') => returns 1 if year>=2016, else 0
year0 = int(trips.year.min()) #2014
trips['all_time_week'] = data.week + 52*(data.year - year0) + (data.year/2016).astype('int')

To make the dashboard values more understandable, we are creating string maps to convert the dataset's numbers to their proper names. Though it may seem trivial, it removes unnecessary ambiguity and helps [reduce cognitive load](https://www.nngroup.com/articles/minimize-cognitive-load/) when our focus needs to be on finding patterns:

In [None]:
# create a weekday string map
days_of_week_map = {
    0: 'monday',
    1: 'tuesday',
    2: 'wednesday',
    3: 'thursday',
    4: 'friday',
    5: 'saturday',
    6: 'sunday'
}

month_map = {
    1: 'jan', 2: 'feb', 3: 'mar', 4: 'apr', 5: 'may', 6: 'jun', 7: 'jul', 8: 'aug', 9: 'sep', 10: 'oct', 11: 'nov', 12: 'dec'
}
day_type_map = {0:'weekday', 1:'weekend', '':'all'}

Finally, we remove the unused columns and reorganize our dataframe:

In [None]:
trips = trips[[
    'year', 'month', 'week', 'day', 'hour', 'gender', 'from_station_name',
    'from_station_id', 'to_station_id', 'x', 'y', 'from_station_name', 'to_station_name', 'all_time_week', 'day_type'
]]
trips.head()

In [None]:
# Note: save modified trips dataframe to be imported in the final notebok
trips.to_parquet(DATA_DIR / 'modified_trips.parquet')

## cuxfilter Bike Trips Dashboard
First lets investigate trip totals by varous time slices by linking the dataframe to cuxfilter:

In [None]:
cux_df = cuxfilter.DataFrame.from_dataframe(data)

In [None]:
# Specify the charts and widgets to use with the selected columns of data and string maps
charts = [
    cuxfilter.charts.bar('hour', title='trips per hour'),
    cuxfilter.charts.bar('month', x_label_map=month_map),
    cuxfilter.charts.bar('day', x_label_map=days_of_week_map),
    cuxfilter.charts.multi_select('year'),
    cuxfilter.charts.multi_select('day_type', label_map=day_type_map),
]

# Generate the dashboard and select a layout
d = cux_df.dashboard(charts, layout=cuxfilter.layouts.feature_and_double_base, title='Bike Trips Dashboard')

# Update the yaxis ticker to an easily readable format
for i in charts:
    if hasattr(i.chart, 'yaxis'):
        i.chart.yaxis.formatter = NumeralTickFormatter(format="0,0")

In [None]:
# Start the dashboard, a green button should appear to open one in a new tab.
# Note: use the slider below each chart to cross filter.

# IMPORTANT: replace notebook_url with your jupyterhub/binder base url
d.show(notebook_url='http://127.0.0.1:8000', service_proxy='jupyterhub')

## Bike Trips Findings
The dashboard should look something like this:

<img src="https://raw.githubusercontent.com/jupytercon/2020-exactlyallan/master//images/cuxfilter_02_dashboard_1.png" />

Some interesting points to note:
- The overall distributions of trips remains very consistent
- There is a clear pattern of weekday peaks around 7-9am and 4-6pm (commuters?)
- There is a clear pattern of a weekend peak around 10am-8pm (tourists?) 
- Trips increase year over year, substantially decrease in the winter months, but the weekday commuter hours bring the most trips


### Try It Out
Now try using [cuxfilter's user guide](https://docs.rapids.ai/api/cuxfilter/nightly/) and our examples to create a dashboard of the above data using a different layouts, themes, and chart types.

In [None]:
# code here

## cuxfilter Temperature Dashboard
Lets continue investigating, this time following up on the increasing trips year over year and decreases in winter months. 

In [None]:
# Specify the charts and widgets to use with the selected columns of data and string maps
charts = [
    cuxfilter.charts.bar('all_time_week', title='rides per week'),
    cuxfilter.charts.heatmap(x='all_time_week', y='day', aggregate_col='temperature',
                             aggregate_fn='mean', point_size=40, legend_position='right',
                             title='mean temperature by day'),
    cuxfilter.charts.multi_select('day_type', label_map=day_type_map),
]

# Generate the dashboard and select a layout
d = cux_df.dashboard(charts, layout=cuxfilter.layouts.feature_and_base, title='Temperature Dashboard')

# Update the yaxis ticker to an easily readable format
for i in charts:
    if hasattr(i.chart, 'yaxis'):
        i.chart.yaxis.formatter = NumeralTickFormatter(format="0,0")

In [None]:
# Start the dashboard, a green button should appear to open one in a new tab.
# Note: pan to match up the top and bottom chart axis

# IMPORTANT: replace notebook_url with your jupyterhub/binder base url
d.show(notebook_url='http://127.0.0.1:8000', service_proxy='jupyterhub')

## Weather Findings
The dashboard should look something like this:
<img src="https://raw.githubusercontent.com/jupytercon/2020-exactlyallan/master//images/cuxfilter_02_dashboard_2.png" />

The weather's effect becomes clear in this dashboard as warmer temperatures seem to strongly match a large increase in ride counts - which intuitively makes sense. But aside developing weather control, there is'nt much that can be done to respond to this finding. 


## cuxfilter Geospatial Trips Graph
Next, lets take a look at the geospatial element of the data and see if we can find interesting patterns. Based on how the trip data is logged, converting it into a graph will make managing it easier.

For this we will need [cuGraph](https://docs.rapids.ai/api/cugraph/stable/api.html) to translate the dataset into an edge list:

In [None]:
G = cugraph.Graph() 
G.from_cudf_edgelist(data, source='from_station_id', destination='to_station_id')
edges = G.edges()

In [None]:
# Trips have been converted into edges with source and destination based on station IDs.
edges.head()

Next we load the formatted data into cuxfilter and specify the chart types:

In [None]:
cux_df = cuxfilter.DataFrame.load_graph((trips, edges))

In [None]:
# Specifying a graph chart type will use Datashader and its required parameters
charts = [
    cuxfilter.charts.graph(
        node_id='from_station_id',
        edge_source='src', edge_target='dst',
        node_aggregate_fn='count',
        node_pixel_shade_type='linear', node_point_size=35, #node size is fixed
        edge_render_type='curved', #other option: direct
        edge_transparency=0.7, #0.1 - 0.9
        tile_provider='CARTODBPOSITRON', 
        title='Graph for trip source_stations (color by count)'
    ),
    cuxfilter.charts.multi_select('year'),
    cuxfilter.charts.multi_select('day_type', label_map=day_type_map),
    cuxfilter.charts.bar('from_station_id'),
    cuxfilter.charts.bar('to_station_id'),
    cuxfilter.charts.view_dataframe(['from_station_name', 'from_station_id'], drop_duplicates=True)
]

# Generate the dashboard, select a layout and theme
d = cux_df.dashboard(charts, layout=cuxfilter.layouts.feature_and_triple_base, theme=cuxfilter.themes.rapids, title='Geospatial Trips')

# Update the yaxis ticker to an easily readable format
for i in charts:
    if hasattr(i.chart, 'yaxis'):
        i.chart.yaxis.formatter = NumeralTickFormatter(format="0,0")

In [None]:
# Start the dashboard, a green button should appear to open one in a new tab.
# Note: Graph edges can be turned on/off via the line tool icon
# Note: Inspect Neighboring Edges can be turned on/off for box or lasso select
# Caution: Selecting areas with Inspect Neighboring Edges on can result in slow performance or OOM errors  
# Caution: If the dashboard freezes, simply close the tab and restart this cell
# Note: This is rendering 9 MILLION edges

# IMPORTANT: replace notebook_url with your jupyterhub/binder base url
d.show(notebook_url='http://127.0.0.1:8000', environment='jupyterhub')

## Geospatial Findings
The dashboard should look something like this:
<img src="https://raw.githubusercontent.com/jupytercon/2020-exactlyallan/master//images/cuxfilter_02_dashboard_3.png" />

Overall there are many patterns of interest:
- There is overall high bike network utilization
- A smaller core region of the network accounts for a majority of the trips
- Most of these core trips seem to relate the the weekday commuters
- The weekend trips are more spread out along the coast
- The older parts of the network start in the core and radiate outward though the years


## cuxfilter Network and Geospatial Graph
While the above produced many findings, filtering through so many trip edges is not ideal.
Next we will try to push the visual analytics further with a clustered network graph along side the geospatial graph using the [ForceAtlas2](https://docs.rapids.ai/api/cugraph/stable/api.html?highlight=force#module-cugraph.layout.force_atlas2) algorithm from cuGraph:

In [None]:
# Note: Often a good visualization result only comes from a lot of trial and error
# The below parameters produce useful clustering, but try experimenting with them further
ITERATIONS=500
THETA=10.0
OPTIMIZE=True

# Using the previously created edge list, we calculate the FA2 layout positions here
trips_force_atlas2_layout = cugraph.layout.force_atlas2(G, max_iter=ITERATIONS,
                strong_gravity_mode=False,
                outbound_attraction_distribution=True,
                lin_log_mode=False,
                barnes_hut_optimize=OPTIMIZE, barnes_hut_theta=THETA, verbose=True)

Merge the calculated forceAtlas2 layout with the trip dataframe:

In [None]:
final_df = trips_force_atlas2_layout.merge(
                trips[['from_station_id', 'from_station_name','to_station_id', 'year', 'hour', 'day_type', 'x', 'y']],
                left_on='vertex',
                right_on='from_station_id',
                suffixes=('', '_original')
)

# Preview
final_df.head()

Next we load the data into cuxfilter and specify the chart types:

In [None]:
cux_df = cuxfilter.DataFrame.load_graph((final_df, edges))

In [None]:
# Both scatter and graph chart types use Datashader 
charts= [
  cuxfilter.charts.graph(
      edge_source='src', edge_target='dst',
      edge_color_palette=['gray', 'black'],
      ode_pixel_shade_type='linear',
      edge_render_type='curved', #other option: direct
      edge_transparency=0.7, #0.1 - 0.9
      title='ForceAtlas2 Layout Graph'
  ),
  cuxfilter.charts.scatter(
    x='x_original', y='y_original', 
    tile_provider='CARTODBPOSITRON',
    point_size=3,
    pixel_shade_type='linear',
    pixel_spread='spread',
    title='Original Layout'
  ),
  cuxfilter.charts.multi_select('year'),
  cuxfilter.charts.multi_select('day_type', label_map={0:'weekday', 1:'weekend', '':'all'}),
  cuxfilter.charts.bar('hour', title='Trips per hour'),
  cuxfilter.charts.bar('from_station_id', title='Source station'),
  cuxfilter.charts.bar('to_station_id', title='Destination station'),
  cuxfilter.charts.view_dataframe(['from_station_id', 'from_station_name'], drop_duplicates=True)
] 

# Generate the dashboard, select a layout and theme
d = cux_df.dashboard(charts, layout=cuxfilter.layouts.double_feature_quad_base, theme=cuxfilter.themes.rapids, title="Network and Geospatial Graph")

# Update the yaxis ticker to an easily readable format
for i in charts:
    if hasattr(i.chart, 'yaxis'):
        i.chart.yaxis.formatter = NumeralTickFormatter(format="0,0")

In [None]:
# Start the dashboard, a green button should appear to open one in a new tab.
# Note: Graph edges can be turned on/off via the line tool icon
# Note: Inspect Neighboring Edges can be turned on/off for box or lasso select
# Caution: Selecting areas with Inspect Neighboring Edges on can result in slow performance or OOM errors  
# Caution: If the dashboard freezes, simply close the tab and restart this cell
# Note: This is rendering 9 MILLION edges

# IMPORTANT: replace notebook_url with your jupyterhub/binder base url
d.show(notebook_url='http://127.0.0.1:8000', environment='jupyterhub')

## Network and Geospatial Findings
The dashboard should look something like this:
<img src="https://raw.githubusercontent.com/jupytercon/2020-exactlyallan/master/images/cuxfilter_02_dashboard_4.png" />

Running the FA2 algorithm to group the station nodes together in a graph and placing the geospatial chart along side provided some compelling findings:
- Stations form clusters of connectivity that are clearly geographically distinct 
- The core weekday group is actually multiple distinct clusters in close proximity (different work districts?)
- The weekday group stays focused until after work hours where they then disperse north (happy hour?)
- The weekend group is overall more spread out, starting along the coast then dispersing throughout the city towards the evening (sight seeing?)
- Theater on Lake Station is a hyper focal point for the weekend group

These are only a few notable points found relatively quickly - there are certainly more patterns.


## Summary of Exploratory Findings
Based on the exploratory analytics done above, we've found that there are two distinct groups of behaviors based on time (hour / weekend / weekday) and location. With the next notebook, we will see if we can coax out further information about these groups using more advanced data analytics.

### cuxfilter Troubleshooting
As we just released the graph visualization capability in cuxfilter, we are still working on building out features and fixes. 

If you find something that needs fixing or have feature requests, please submit an [issue on our Github Page](https://github.com/rapidsai/cuxfilter/issues). Better yet, [help contribute](https://github.com/rapidsai/cuxfilter#contributing-developers-guide). 