In [1]:
import numpy as np
import pandas as pd

# Import Bokeh modules for interactive plotting
import bokeh.plotting

# DataShader
import datashader as ds
import datashader.bokeh_ext as ds_bokeh_ext

# Display graphics in this notebook
bokeh.io.output_notebook()

## Visualizing large data sets with DataShader

Manuel will give an auxiliary lesson toward the end of the course on DataShader. Here is a nice preview of what you will see there.

When you have huge amount of data, rendering a plot can be intensive on system resources, since you might be plotting millions of points. You really only need to see the individual points if you zoom in; when looking at the plot from a "birds eye view," you only need to see roughly what the data look like, and you can do so in a rasterized way.  This is what DataShader affords us.

To test it, let's draw many samples out of a 2D distribution. We'll just take the $x$ values out of a Student-t distribution and the $y$ values out of a Gaussian. We'll draw 100,000 samples of each and plot them.

In [2]:
# Generate random points and put in DataFrame
x, y = np.random.standard_t(5, size=100000), np.random.normal(0, 10, size=100000)
df = pd.DataFrame(data={'x': x, 'y': y})
source = bokeh.models.ColumnDataSource(df)

# Make plot
p = bokeh.plotting.figure(height=350, width=400, 
                          x_axis_label='x', y_axis_label='y')
circles = p.circle(source=source, x='x', y='y')
bokeh.io.show(p)

We immediately see a problem. We have no idea what the density of points is in the middle of the plot. We could try to use transparency to see if that helps.

In [3]:
circles.glyph.fill_alpha = 0.1
circles.glyph.line_alpha = 0.1
bokeh.io.show(p)

This helps a little bit, but we still have a problem. It is also hard to decide what a good value for `alpha` is to get good resolution of the transparency.

Here is where DataShader really helps. We can use it to shade the the plot depending on the density of points. 

Before doing that, we need to write a function to get the range of the axes of the plot.

In [4]:
def data_range(df, margin=0.02):
    x_range = df['x'].max() - df['x'].min()
    y_range = df['x'].max() - df['x'].min()
    return ([df['x'].min() - x_range*margin, df['x'].max()+ - x_range*margin],
            [df['y'].min() - y_range*margin, df['y'].max()+ - y_range*margin])

x_range, y_range = data_range(df)

Now, we have to write a function that creates an image that is called every time the Bokeh plot is updated.

In [5]:
def create_image(x_range, y_range, w, h):
    cvs = ds.Canvas(x_range=x_range, y_range=y_range, plot_height=int(h), 
                    plot_width=int(w))
    agg = cvs.points(df, 'x', 'y')
    return ds.transfer_functions.shade(agg, cmap=ds.colors.viridis, how='linear')

Finally, we set up the figure using Bokeh. Then, DataShader paints an interactive image on the Bokeh figure. It gives us access to Bokeh's great zoom tools. Let's give it a shot.

In [6]:
p = bokeh.plotting.figure(height=350, width=400, x_range=x_range, y_range=y_range,
                          x_axis_label='x', y_axis_label='y')
ds_bokeh_ext.InteractiveImage(p, create_image)

This gives a more true picture of how the data look. They are dense around the center, and fade out away from there. Now, 100,000 points is not very big. DataShader really shines when we have huge data sets. Let's try 10,000,000 data points. Bokeh (or Matplotlib) would have a really tough time rendering this, and there would be really awful overlap of data points rendering interpretation nearly impossible.

In [7]:
# Generate random points and put in DataFrame
x = np.random.standard_t(5, size=10000000)
y = np.random.normal(0, 10, size=10000000)
df = pd.DataFrame(data={'x': x, 'y': y})
x_range, y_range = data_range(df)

p = bokeh.plotting.figure(height=350, width=400, x_range=x_range, y_range=y_range,
                          x_axis_label='x', y_axis_label='y')
ds_bokeh_ext.InteractiveImage(p, create_image)

This is much clearer. Thank you, DataShader!