# Plotting very large datasets

There are a variety of approaches for plotting large datasets, but most of them are very unsatisfactory. Here we first show some of the issues, then demonstrate how the Datashader library helps make large datasets practical.

In [None]:
import pandas as pd

from bokeh.plotting import figure, output_notebook, show
from bokeh.models import ColumnDataSource, CustomJS, Range1d
from bokeh.io import push_notebook
from bokeh.tile_providers import STAMEN_TONER

from datashader.pipeline import DatashaderPipeline, Interpolate
from datashader.callbacks import IPythonKernelCallback

output_notebook()

## Load NYC Taxi data (takes a dozen seconds or so...)

In [None]:
df = pd.read_csv('data/nyc_taxi.csv',usecols=['pickup_x','pickup_y','dropoff_x','dropoff_y','passenger_count'])
x_range = (-8240227.037,-8231283.905)
y_range = (4974203,4979238)
df.tail()

## Define a simple plot

In [None]:
def base_plot():
    p = figure(tools='pan,wheel_zoom,box_zoom', plot_width=800, plot_height=500, x_range=x_range, y_range=y_range)
    p.add_tile(STAMEN_TONER)
    p.axis.visible = False
    return p
    
options = dict(line_color='black', fill_color='red')

## A few points are fine using a Bokeh scatterplot

In [None]:
samples = df.sample(n=1000)
p = base_plot()
p.circle(x=samples['pickup_x'], y=samples['pickup_y'], **options)
show(p)

## When plotting more than a couple thousand points, the study area is saturated.

In [None]:
samples = df.sample(n=10000)
p = base_plot()
p.circle(x=samples['pickup_x'], y=samples['pickup_y'], **options)
show(p)

## Making the points tiny and partially transparent helps a bit
  
However, it is tricky to set the size and alpha parameters.  The correct value of both depends on zoom level and number of points; at higher zooms you need larger sizes and higher alpha values, which requires editing the code each time.

Plotting also starts getting very slow for > 10000 points.  With some browsers you can use Bokeh's WebGL support to render additional points relatively quickly, but there will always be a limit on the number of points that will work well in a web browser.

In [None]:
options = dict(line_color='red', fill_color='red', size=1, alpha=0.2)
samples = df.sample(n=10000)
p = base_plot()
p.circle(x=samples['pickup_x'], y=samples['pickup_y'], **options)
show(p)

## Using datashader, you can easily aggregate points and conquer over-saturation

Datashader renders the entire dataset into a buffer in a separate Python process, always providing a fixed-size image to the browser.  The number of points is no longer a limiting factor, so you can use the entire dataset, and there is no need to set the alpha parameter.  This way you can zoom very far in interactively, seeing all the points available in that viewport, without ever needing to change the plot parameters.  Each time you zoom or pan, a new image is rendered (which takes a few seconds for large datasets), and displayed overlaid the other plot elements, providing full access to all of your data.

In [None]:
import datashader as ds
p = base_plot()
pipeline = DatashaderPipeline(df=df, glyph=ds.Point("pickup_x", "pickup_y"), agg=ds.count("passenger_count"))
IPythonKernelCallback(p, pipeline)

In [None]:
p = base_plot()
pipeline = DatashaderPipeline(df=df, glyph=ds.Point("pickup_x", "pickup_y"), agg=ds.count("passenger_count"),
                              color_fn=Interpolate(low="lightblue",high="blue"))
IPythonKernelCallback(p, pipeline)

## Unpacking the steps involved in the Datashader pipeline

The above functions use a configurable interface to make it simpler to specify individual bits of a standard scatterplot-like pipeline. If you want, you can do the same process with your own custom code to do whatever you like; anything that results in an image is fine!

In [None]:
from datashader.callbacks import IPythonKernelCallback
import datashader as ds
from datashader import transfer_functions as tf

def create_image(p, ranges, agg_fn=ds.count):
    x_range, y_range = ranges['x_range'], ranges['y_range']
    h, w = ranges['h'], ranges['w']
    cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)
    agg = cvs.points(df, 'pickup_x', 'pickup_y', agg_fn('passenger_count'))
    pix = tf.interpolate(agg, "lightpink", 'red', how='log')
    dh = y_range[1] - y_range[0]
    dw = x_range[1] - x_range[0]
    p.image_rgba(image=[pix.img], x=x_range[0], y=y_range[0], dw=dw, dh=dh, dilate=False)

p = base_plot()
IPythonKernelCallback(p, create_image, agg_fn=ds.count)