# Using Datashader with Bokeh in HoloViews

[HoloViews](http://holoviews.org) (1.7 and later) is a high-level data analysis and visualization library that makes it simple to generate interactive [Datashader](https://github.com/bokeh/datashader)-based plots in [Bokeh](http://bokeh.pydata.org).  Using these three tools together, it is simple to work with billions or more points interactively in a web browser:

  ![Datashader+Holoviews+Bokeh](https://github.com/bokeh/datashader/raw/master/docs/images/ds_hv_bokeh.png)

Basically, HoloViews makes it simple to switch between Datashader and regular plots when creating interactive web visualizations.  A developer willing to do more programming can do all the same things separately, using Bokeh and Datashader's APIs directly, but with HoloViews it is much simpler to explore and analyze data.  And you can also use datashader without either Bokeh or HoloViews (the light gray lines above), but then you'll need to do a *lot* more work yourself.

To see how these tools work together, let's define some synthetic data to plot:

In [None]:
import numpy as np
import holoviews as hv
import datashader as ds
from holoviews.operation.datashader import aggregate, datashade, dynspread, shade
from holoviews.operation import decimate
hv.notebook_extension('bokeh')
decimate.max_samples=1000
dynspread.max_px=20
dynspread.threshold=0.5
shade.cmap="#30a2da" # to match HV Bokeh default

def random_walk(n, f=5000):
    """Random walk in a 2D space, smoothed with a filter of length f"""
    xs = np.convolve(np.random.normal(0, 0.1, size=n), np.ones(f)/f).cumsum()
    ys = np.convolve(np.random.normal(0, 0.1, size=n), np.ones(f)/f).cumsum()
    xs += 0.1*np.sin(0.1*np.array(range(n-1+f))) # add wobble on x axis
    xs += np.random.normal(0, 0.005, size=n-1+f) # add measurement noise
    ys += np.random.normal(0, 0.005, size=n-1+f)
    return np.column_stack([xs, ys])

def random_cov():
    """Random covariance for use in generating 2D Gaussian distributions"""
    A = np.random.randn(2,2)
    return np.dot(A, A.T)

def time_series(T = 1, N = 100, mu = 0.1, sigma = 0.1, S0 = 20):  
    """Parameterized noisy time series"""
    dt = float(T)/N
    t = np.linspace(0, T, N)
    W = np.random.standard_normal(size = N) 
    W = np.cumsum(W)*np.sqrt(dt) # standard brownian motion
    X = (mu-0.5*sigma**2)*t + sigma*W 
    S = S0*np.exp(X) # geometric brownian motion
    return S

# HoloViews Elements

Rather than starting out by specifying a figure or plot, in HoloViews you specify an "Element" object to contain your data, such as `Points` (scatterplots) or `Path` (trajectories).  Even though these objects are fundamentally data containers, not visualizations, if you ask for their representation in a Juypter notebook a corresponding Bokeh plot will be created (see the ["rich display" notebook](https://anaconda.org/jbednar/rich_display) for more details):

In [None]:
np.random.seed(1)
positions = np.random.multivariate_normal((0,0), [[0.1,0.1], [0.1,1.0]], (1000,))

points = hv.Points(positions,label="Points")
paths  = hv.Path([random_walk(2000,30)], label="Paths")

points + paths

These browser-based plots are fully interactive, as you can see if you select the Wheel Zoom or Box Zoom tools and use your scroll wheel or click and drag.  

Because all of the data in these plots gets transferred directly into the web browser, the interactive functionality will be available even on a static export of this figure as a web page, such as on anaconda.org.  However, this flexibility comes at the cost of being unable to handle larger datasets, whose data will quickly overwhelm the browser and cause slowdowns or crashes after a few tens or hundreds of thousands of data points.

Moreover, even with just 1000 points as in the scatterplot above, the plot already suffers from [overplotting](https://anaconda.org/jbednar/plotting_pitfalls), with later points obscuring previously plotted points.  With much larger datasets, these issues will quickly make it impossible to see the true structure of the data.  


# Datashader operations

If we tried to visualize the two HoloViews Elements below, which are just larger versions of the same data above, the plots would be nearly unusable even if the browser did not crash:

In [None]:
np.random.seed(1)
positions = np.random.multivariate_normal((0,0), [[0.1,0.1], [0.1,1.0]], (1000000,))

points = hv.Points(positions,label="Points")
paths  = hv.Path([0.15*random_walk(100000) for i in range(10)],label="Paths")

#points + paths  ## Danger! Browsers can't handle 1 million points!

Luckily, because HoloViews Elements are just containers for data and associated metadata, not plots, HoloViews can generate entirely different types of visualizations from the same data structure when appropriate.  For instance, in the plot on the left below you can see the result of adding a `decimate()` operator acting on the `points` object, which will automatically downsample this million-point dataset to at most 1000 points at any time as you zoom in or out:

In [None]:
decimate(points) + datashade(points) + datashade(paths)

Decimating a plot in this way can be useful, but it discards most of the data, yet still suffers from overplotting. If you have Datashader installed, you can instead use the `datashade()` operation from HoloViews to create a dynamic Datashader-based Bokeh plot.  (Here `datashade()` is just a convenient shortcut for the two main steps in data shading, i.e., `shade(aggregate())`, which can also be invoked separately.) The middle plot above shows the result of using `datashade()` to create a dynamic Datashader-based plot out of an Element with arbitrarily large data.  In the Datashader version, a new image is regenerated automatically on every zoom or pan event, revealing all the data available at that zoom level and avoiding issues with overplotting.

These two Datashader-based plots are similar to the native Bokeh plots above, but instead of making a static Bokeh plot that embeds points or line segments directly into the browser, HoloViews sets up a Bokeh plot with dynamic callbacks that render the data as an RGB image using Datashader instead.  The dynamic re-rendering provides an interactive user experience even though the data itself is never provided directly into the browser.  Of course, because the full data is not in the browser, a static export of this page (e.g. on anaconda.org) will only show the initially rendered version, and will not update with new images when zooming as it will when there is a live Python process available.

Though you can no longer have a completely interactive exported file, with the Datashader version on a live server you can now change 1000000 to 10000000 or more to see how well your machine will handle larger datasets. It will get a bit slower, but if you have enough memory, it should still be very usable, and should never crash your browser as transferring the whole dataset into your browser would.  If you don't have enough memory, you can instead set up a [Dask](http://dask.pydata.org) dataframe as shown in other Datashader examples, which will provide out of core and/or distributed processing to handle even the largest datasets.

## Spreading



The Datashader examples above treat points and lines as infinitesimal in width, such that a given point or small bit of line segment appears in at most one pixel. This approach ensures that the overall distribution of the points will be mathematically well founded -- each pixel will scale in value directly by the number of points that fall into it, or by the lines that cross it.

However, many monitors are sufficiently high resolution that the resulting point or line can be difficult to see---a single pixel may not actually be visible on its own, and the color of it is likely to be very difficult to make out.  To compensate for this issue, HoloViews provides access to Datashader's image-based "spreading", which makes isolated pixels "spread" into adjacent ones for visibility.  Because the amount of spreading that's useful depends on how close the datapoints are to each other on screen, the most useful such function is `dynspread`, which spreads up to a maximum sized as long as it does not exceed a specified fraction of adjacency between pixels.  You can compare the results in the two plots below after zooming in:

In [None]:
datashade(points) + dynspread(datashade(points))

Both plots show the same data, and look identical when zoomed out, but when zoomed in enough you should be able to see the individual data points on the right while the ones on the left are barely visible.  The dynspread parameters typically need some hand tuning, as the only purpose of such spreading is to make things visible on a particular monitor for a particular observer; the underlying mathematical operations in Datashader do not normally need parameters to be adjusted.

The same operation works similarly for line segments:

In [None]:
datashade(paths) + dynspread(datashade(paths))

# Multidimensional plots

The above plots show two dimensions of data plotted along *x* and *y*, but Datashader operations can be used with additional dimensions as well.  For instance, an extra dimension (here called `k`), can be treated as a category label and used to colorize the points or lines.  Compared to a standard scatterplot that would suffer from overplotting, here the result will be merged mathematically by Datashader, completely avoiding any overplotting issues except local ones due to spreading:

In [None]:
%%opts RGB [width=400] {+axiswise}
from datashader.colors import Sets1to3 # default datashade() and shade() color cycle

np.random.seed(3)
kdims=['d1','d2']
num_ks=8

def rand_gauss2d():
    return 100*np.random.multivariate_normal(np.random.randn(2), random_cov(), (100000,))

gaussians = {i: hv.Points(rand_gauss2d(), kdims=kdims) for i in range(num_ks)}
lines = {i: hv.Curve(time_series(N=10000, S0=200+np.random.rand())) for i in range(num_ks)}

gaussspread = dynspread(datashade(hv.NdOverlay(gaussians, kdims=['k']), aggregator=ds.count_cat('k'), cmap=Sets1to3))
linespread  = dynspread(datashade(hv.NdOverlay(lines,     kdims=['k']), aggregator=ds.count_cat('k'), cmap=Sets1to3))

gaussspread + linespread

Because Bokeh only ever sees an image, providing legends and keys has to be done separately, though we are working to make this process more seamless.  For now, you can show a legend by adding a suitable collection of labeled points:

In [None]:
%%opts RGB [width=600]

gaussspread = dynspread(datashade(hv.NdOverlay(gaussians, kdims=['k']), aggregator=ds.count_cat('k'), cmap=Sets1to3))

color_key = list(enumerate(Sets1to3[0:num_ks]))
color_points = hv.NdOverlay({k: hv.Points([0,0], label=str(k)).opts(style=dict(color=v)) for k, v in color_key})

color_points * gaussspread

The `hv.NdOverlay` data structure merges all values along that dimension into the same image, (optionally) coloring each point or line to keep the values visibly different.  If you prefer to keep the values completely separate so that every image contains only one value along the `k` dimension, you can put the data into an `hv.HoloMap`, which lets you index to choose a value for that dimension (or for multiple such dimensions).  If you have not indexed into the dimension(s) to choose one location in the multidimensional space, HoloViews will automatically generate slider widgets to allow you to choose such a value interactively:

In [None]:
%%opts RGB [width=300] {+axiswise}

datashade(hv.HoloMap(gaussians, kdims=['k'])) + datashade(hv.HoloMap(lines, kdims=['k']))

You can thus very naturally explore even very large multidimensional datasets.  Note that the static exported version (e.g. on anaconda.org) will only show a single frame, rather than the entire set of frames visible with a live Python server.


## Working with time series

Although Datashader does not natively [support datetime(64)](https://github.com/bokeh/datashader/issues/270) types for its dimensions, we can convert to an integer representation and apply a custom formatter using HoloViews and Bokeh:

In [None]:
from bokeh.models import DatetimeTickFormatter
def apply_formatter(plot, element):
    plot.handles['xaxis'].formatter = DatetimeTickFormatter()
    
import pandas as pd
drange = pd.date_range(start="2014-01-01", end="2016-01-01", freq='1D') # or '1min'
dates = drange.values.astype('int64')/10**6 # Convert dates to ints

In [None]:
%%opts RGB [finalize_hooks=[apply_formatter] width=800]

curve = hv.Curve((dates, time_series(N=len(dates), sigma = 1)))
datashade(curve, cmap=["blue"])

HoloViews also supplies some operations that are useful in combination with Datashader timeseries.  For instance, you can compute a rolling mean of the results and then show a subset of outlier points, which will then support hover, selection, and other interactive Bokeh features:

In [None]:
%%opts Overlay [finalize_hooks=[apply_formatter] width=800] 
%%opts Scatter [tools=['hover', 'box_select']] (line_color="black" fill_color="red" size=10)
from holoviews.operation.timeseries import rolling, rolling_outlier_std
smoothed = rolling(curve, rolling_window=50)
outliers = rolling_outlier_std(curve, rolling_window=50, sigma=2)

datashade(curve, cmap=["blue"]) * dynspread(datashade(smoothed, cmap=["red"]),max_px=1) * outliers

In [None]:
rolling.function

Note that the above plot will look blocky in a static export (such as on anaconda.org), because the exported version is generated without taking the size of the actual plot (using default height and width for Datashader) into account, whereas the live notebook automatically regenerates the plot to match the visible area on the page.

# Hover info

As you can see, converting the data to an image using Datashader makes it feasible to work with even very large datasets interactively.  One unfortunate side effect is that the original datapoints and line segments can no longer be used to support "tooltips" or "hover" information directly; that data simply is not present at the browser level, and so the browser cannot unambiguously report information about any specific datapoint. Luckily, you can still provide hover information that reports properties of a subset of the data in a separate layer (as above), or you can provide information for a spatial region of the plot rather than for specific datapoints.  For instance, in some small rectangle you can provide statistics such as the mean, count, standard deviation, etc:

In [None]:
%%opts QuadMesh [tools=['hover']] (alpha=0 hover_alpha=0.2)
from holoviews.streams import RangeXY

fixed_hover = datashade(points, width=400, height=400) * \
    hv.QuadMesh(aggregate(points, width=10, height=10, dynamic=False))

dynamic_hover = datashade(points, width=400, height=400) * \
    hv.util.Dynamic(aggregate(points, width=10, height=10, streams=[RangeXY]), operation=hv.QuadMesh)

fixed_hover + dynamic_hover

In the above examples, the plot on the left provides hover information at a fixed spatial scale, while the one on the right reports on an area that scales with the zoom level so that arbitrarily small regions of data space can be examined, which is generally more useful.

As you can see, HoloViews makes it just about as simple to work with Datashader-based plots as regular Bokeh plots (at least if you don't need hover or color keys!), letting you visualize data of any size in a browser using just a few lines of code. Because Datashader-based HoloViews plots are just a few extra steps added on to regular HoloViews plots, they support all of the same features as regular HoloViews objects, and can freely be laid out, overlaid, and nested together them.  See [holoviews.org](http://holoviews.org) for examples and documentation for how to control the appearance of these plots and how to work with them in general.