<a href='bokeh.pydata.org'><img src="assets/bokeh_logo.svg" alt="Bokeh logo" width="4%;" align="right"/></a>

<a href='http://www.holoviews.org'><img src="assets/header_logo.png" alt="HoloViews logo" width="20%;" align="left"/></a>
<div style="float:right;"><h2>07. Working with large datasets</h2></div>

HoloViews allows even high-dimensional datasets very easily however on its own standard data interfaces and plotting extensions can quickly reach their limits with very large datasets. When working with datasets of millions or even billions of datapoints HoloViews can flexibly extend to work with [``dask``](http://dask.pydata.org/en/latest/) dataframes which allows for out-of-core computation and parallel processing of large tabular datasets. Additionally HoloViews provides an interface for [``datashader``](http://datashader.readthedocs.io/en/latest/) allowing you to quickly visualize large datasets.

The datashader library is designed to complement standard plotting libraries by providing visualizations for very large datasets, focusing on faithfully revealing the overall distribution, not just individual data points.

Dask DataFrame is provides a functionally equivalent API to pandas but allows working with data out of core and scales out to many processors and even clusters. Here we will use it to load a large CSV files of taxi coordinates.

<div >
    <img align="left" src="./assets/numba.png" width='140px'></img>
<img align="left" src="./assets/dask.svg" width='115px'></img>
<img align="left" src="./assets/datashader.png" width='158px'></img>
<img align="left" src="./assets/holoviews.png" width='140px'></img>
</div>

### How does datashader work?

* Tools like Bokeh map Data directly into an HTML/JavaScript Plot
* datashader renders Data into a screen-sized Aggregate array, from which an Image can be constructed then embedded into a Bokeh Plot
* Only the fixed-sized Image needs to be sent to the browser, allowing millions or billions of datapoints to be used
* Every step automatically adjusts to the data, but can be customized

<img src="./assets/datashader_pipeline.png"></img>

#### When not to use datashader

* Plotting less than 1e5 or 1e6 data points
* When every datapoint matters; standard Bokeh will render all of them
* For full interactivity (hover tools) with every datapoint

#### When to use datashader

* Actual big data; when Bokeh/Matplotlib have trouble
* When the distribution matters more than individual points
* When you find yourself sampling or binning to better understand the distribution

In [None]:
import pandas as pd
import holoviews as hv
import dask.dataframe as dd
import datashader as ds
import geoviews as gv

from holoviews.operation.datashader import datashade, aggregate
from holoviews.plotting.util import fire
datashade.cmap = fire
hv.extension('bokeh')

## Load the data

As a first step we will load a large dataset using dask. If you have followed the setup instructions you will have downloaded a large CSV containing 12 million taxi trips, which will will load using dask, creating a dask dataframe that provides an API replicating pandas:

In [None]:
ddf = dd.read_csv('../data/nyc_taxi.csv', parse_dates=['tpep_pickup_datetime'])
ddf['hour'] = ddf.tpep_pickup_datetime.dt.hour

# If your machine is low on RAM (<8GB) don't persist (everything will be much slower)
ddf = ddf.persist()
print('%s Rows' % len(ddf))
print('Columns:', list(ddf.columns))

## Create a dataset

In previous have already seen how to declare a set of [``Points``](http://holoviews.org/reference/elements/bokeh/Points.html) from a DataFrame, this works much the same, we pass in the DataFrame along with the key dimensions. Remember however we have 12 million rows of data, no plotting program will handle this well! Therefore we will use the ``datashader`` operation which will aggregate the data on a 2D grid and then apply shading, leaving us with an ``RGB`` Element to display:

In [None]:
%opts RGB [width=800 height=400]
points = hv.Points(ddf, kdims=['dropoff_x', 'dropoff_y'])
datashade(points)

If you zoom in you will have noted that the plot rerenders depending on the zoom level. This is because the datashade operation is a dynamic operation that also declares some linked streams. These are automatically instantiated and supply the ``x_range`` and ``y_range`` to the operation, which dynamically change as you zoom in.

In [None]:
datashade.streams

In [8]:
# Exercise: Plot the taxi pickup locations 
# Hint: The dataframe contains 'pickup_x' and 'pickup_y' columns
# Optional: Change the cmap on the datashade operation to inferno

from colorcet import inferno

## Adding a tile source

Using the GeoViews extension for HoloViews we can display a tile source in the background. Declare a bokeh WMTSTileSource and pass it to the gv.WMTS Element, then we can overlay it:

In [None]:
%opts RGB [xaxis=None yaxis=None]
import geoviews as gv
from bokeh.models import WMTSTileSource
url = 'https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{Z}/{Y}/{X}.jpg'
wmts = WMTSTileSource(url=url)
gv.WMTS(wmts) * datashade(points)

In [11]:
# Exercise: Overlay the taxi pickup data on top of the Wikipedia tile source

wiki_url = 'https://maps.wikimedia.org/osm-intl/{Z}/{X}/{Y}@2x.png'

## Aggregating with a variable

So far we have simply been counting taxi dropoffs, but our dataset is much richer than that. We have information about a number of variables including the total cost of a taxi ride, the ``total_amount``. Datashader provides a number of ``aggregator`` functions, which you can supply to the datashade operation. Here use the ``ds.mean`` aggregator to compute the average cost of a trip at a dropoff location:

In [None]:
selected = points.select(total_amount=(None, 1000))
selected.data = selected.data.persist()
gv.WMTS(wmts) * datashade(selected, aggregator=ds.mean('total_amount'))

In [15]:
# Exercise: Use the ds.min or ds.max aggregator to visualize tipping amounts by dropoff location
# Hint: The tipping amounts are given by the ``tip_amount`` column
# Optional: Eliminate outliers by using select

## Grouping by a variable

In [None]:
%opts Image [width=600 height=300 logz=True xaxis=None yaxis=None]
taxi_ds = hv.Dataset(ddf)
grouped = taxi_ds.to(hv.Points, ['dropoff_x', 'dropoff_y'], groupby=['hour'], dynamic=True)
aggregate(grouped).redim.values(hour=range(24))

In [17]:
# Exercise: Facet the trips in the morning hours as an NdLayout 
# Hint: You can reuse the grouped variable or select a subset before using the .to method

## Additional features

We can overlay an invisible ``QuadMesh`` to reveal information on hover.

In [None]:
%%opts QuadMesh [width=800 height=400 tools=['hover']] (alpha=0 hover_line_alpha=1 hover_fill_alpha=0)
hover_info = aggregate(points, width=40, height=20, streams=[hv.streams.RangeXY]).map(hv.QuadMesh, hv.Image)
gv.WMTS(wmts) * datashade(points) * hover_info

## Read more:

* Read the user guide on [Working with large data using datashader](http://holoviews.org/user_guide/Large_Data.html)
* See a [bokeh app](http://holoviews.org/gallery/apps/bokeh/nytaxi_hover.html) using this dataset and an additional linked stream.