## What is Datashader?
- Makes pictures of large datasets, fast!
- Preserves distribution and outliers in the visualization
- Uses Numba and Dask for scale
- Not exactly a plotting library but plays well with HoloViews and Bokeh
- http://datashader.org

## When would I want to use it?
- When you have a LOT of data to plot, like tens of thousands or more points.
- Focus is on ensuring the *distribution* is clear

In [1]:
import numpy as np
np.random.seed(42)

import holoviews as hv
hv.notebook_extension('matplotlib')

%opts Points [color_index=2] (cmap="bwr" edgecolors='k' s=50 alpha=1.0)
%opts Scatter3D [color_index=3 fig_size=250] (cmap='bwr' edgecolor='k' s=50 alpha=1.0)
%opts Image (cmap="gray_r") {+axiswise}
%opts RGB [bgcolor="black" show_grid=False]

import holoviews.plotting.mpl
holoviews.plotting.mpl.MPLPlot.fig_alpha = 0
holoviews.plotting.mpl.ElementPlot.bgcolor = 'white'

from holoviews.operation.datashader import datashade
import colorcet as cc


# Synthetic Example

- A bunch of points from 5 different gaussian distributions 
- 4 clusters of different sizes, and one big cluster that overlaps all of them
- One of the examples in http://datashader.org/user_guide/1_Plotting_Pitfalls.html

In [2]:
def gaussians(specs=[(1.5,0,1.0),(-1.5,0,1.0)],num=100):
    """
    A concatenated list of points taken from 2D Gaussian distributions.
    Each distribution is specified as a tuple (x,y,s), where x,y is the mean
    and s is the standard deviation.  Defaults to two horizontally
    offset unit-mean Gaussians.
    """
    np.random.seed(1)
    dists = [(np.random.normal(x,s,num), np.random.normal(y,s,num)) for x,y,s in specs]
    return np.hstack([d[0] for d in dists]), np.hstack([d[1] for d in dists])


dist = gaussians(specs=[(2,2,0.02), (2,-2,0.1), (-2,-2,0.5), (-2,2,1.0), (0,0,3)],num=10000)

In [3]:
hv.Points(dist) + hv.Points(dist)(style=dict(s=0.1)) + hv.Points(dist)(style=dict(s=0.01,alpha=0.05))

examples.directory is deprecated; in the future, examples will be found relative to the 'datapath' directory.
  "found relative to the 'datapath' directory.".format(key))
The text.latex.unicode rcparam was deprecated in Matplotlib 2.2 and will be removed in 3.1.
  "2.2", name=key, obj_type="rcparam", addendum=addendum)
examples.directory is deprecated; in the future, examples will be found relative to the 'datapath' directory.
  "found relative to the 'datapath' directory.".format(key))
The text.latex.unicode rcparam was deprecated in Matplotlib 2.2 and will be removed in 3.1.
  "2.2", name=key, obj_type="rcparam", addendum=addendum)
examples.directory is deprecated; in the future, examples will be found relative to the 'datapath' directory.
  "found relative to the 'datapath' directory.".format(key))
The text.latex.unicode rcparam was deprecated in Matplotlib 2.2 and will be removed in 3.1.
  "2.2", name=key, obj_type="rcparam", addendum=addendum)
examples.directory is deprecated; in 

With traditional approaches to plotting, you see a very different picture based on what settings you pick

## Big Data Plotting Pitfalls
- Overplotting (Image A)
- Oversaturation
- Undersampling
- Undersaturation
- Underutilized Range
- Nonuniform colormapping

Datashader just works to plot the best image

In [4]:
%output size=200
datashade.cmap=cc.b_linear_bgy_10_95_c74[50:]
datashade(hv.Points(dist))

examples.directory is deprecated; in the future, examples will be found relative to the 'datapath' directory.
  "found relative to the 'datapath' directory.".format(key))
The text.latex.unicode rcparam was deprecated in Matplotlib 2.2 and will be removed in 3.1.
  "2.2", name=key, obj_type="rcparam", addendum=addendum)
examples.directory is deprecated; in the future, examples will be found relative to the 'datapath' directory.
  "found relative to the 'datapath' directory.".format(key))
The text.latex.unicode rcparam was deprecated in Matplotlib 2.2 and will be removed in 3.1.
  "2.2", name=key, obj_type="rcparam", addendum=addendum)


# Real Data
This is a dataset consisting of over 10 million taxi trips in NYC

In [5]:
data_dir = "../datashader-examples/data/"

In [9]:
import geoviews as gv
import pandas as pd

hv.extension('bokeh', width=95)

plot_width= 600
plot_height = int(plot_width//1.2)

%opts RGB     [width=plot_width, height=plot_height, xaxis=None yaxis=None show_grid=False] 
%opts Shape (fill_alpha=0 line_width=1.5) [apply_ranges=False tools=['tap']] 
%opts Points [apply_ranges=False] WMTS (alpha=0.5)

datashade.cmap=cc.fire[50:]

In [10]:
df = pd.read_csv(data_dir + 'nyc_taxi.csv', usecols=\
                                ['pickup_x', 'pickup_y', 'dropoff_x','dropoff_y', 'passenger_count','tpep_pickup_datetime'])

taxi_points = hv.Points(df, kdims=['pickup_x', 'pickup_y'])
len(df)

10679307

In [11]:
shaded1 = datashade(taxi_points)

In [12]:
tiles = gv.WMTS('https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{Z}/{Y}/{X}.jpg')
tiles * shaded1

# Zooming

When connected to a server, DataShader will re-render points when zooming.

In [13]:
shaded2 = datashade(taxi_points)
tiles2 = gv.WMTS('https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{Z}/{Y}/{X}.jpg')
tiles2 * shaded2