# Common plotting issues that get worse with large data

In [None]:
import numpy as np

# Requires holoviews, which can be installed with "conda install -c ioam holoviews"
import holoviews as hv
hv.notebook_extension()
hv.archive.auto(exporters=[hv.Store.renderers['matplotlib'].instance(size=50, fig='png', dpi=144)])

In [None]:
%opts Points [color_index=2] (cmap="bwr" edgecolors='k' s=50 alpha=1.0)
%opts Scatter3D [color_index=3 fig_size=250] (cmap='bwr' edgecolor='k' s=50 alpha=1.0)

## Overplotting

Let's consider plotting data that comes from two categories, here plotted in blue and red as A and B below.  When the two categories are overlaid, the result can be very different depending on which one is plotted first:

In [None]:
np.random.seed(42)
blue_coords = (np.random.normal( 0.5,size=300), np.random.normal( 0.5,size=300))
red_coords  = (np.random.normal(-0.5,size=300), np.random.normal(-0.5,size=300))

blues = hv.Points(blue_coords + (-1,), vdims=['c'])
reds  = hv.Points(red_coords  + ( 1,), vdims=['c'])

quartet = (blues + reds + reds*blues + blues*reds).cols(2)
quartet

Plots C and D shown the same distribution of points, yet they give a very different impression of which category is more common, which can lead to incorrect decisions based on this data.  Actually, both are equally common in this case.  The cause for this problem is simply occlusion:

In [None]:
hmap = hv.HoloMap({0:blues,0.000001:reds,1:blues,2:reds}, key_dimensions=['level'])
hv.Table(hmap.table(), kdims=['x','y','level'], vdims=['c']).to.scatter3d()

Occlusion of data by other data is called **overplotting** or **overdrawing**, and it occurs whenever a datapoint is plotted on top of another datapoint, obscuring it.


## Saturation

You can reduce problems with overplotting by using transparency, via the alpha parameter provided in most plotting programs.  E.g. if alpha is 0.1, full brightness will be achieved only when 10 points overlap, reducing the effects of plot ordering but making it harder to see individual points:

In [None]:
%%opts Points (s=50 alpha=0.1)
quartet

Here C and D look fairly similar (as they should, since the distributions are identical), but there are still a few locations that have reached **saturation**, a problem that will occur when more than 10 points overlap.  With multiple categories as here, saturation leads to overplotting problems, because only the last 10 points plotted will affect the final color.  With a single category, saturation simply obscures differences in density.  For instance, 10, 20, and 2000 points overlapping will all look the same visually, for alpha=0.1.

The biggest problem with using alpha to avoid saturation is that the correct value depends on the dataset -- e.g. if there are more points overlapping, a manually adjusted alpha setting that worked well for a previous dataset will systematically misrepresent the new dataset:

In [None]:
%%opts Points (s=50 alpha=0.1)

np.random.seed(42)
blue_coords = (np.random.normal( 0.5,size=900), np.random.normal( 0.5,size=900))
red_coords  = (np.random.normal(-0.5,size=900), np.random.normal(-0.5,size=900))

blues = hv.Points(blue_coords + (-1,), vdims=['c'])
reds  = hv.Points(red_coords  + ( 1,), vdims=['c'])

(blues + reds + reds*blues + blues*reds).cols(2)

Here C and D again look very different, yet represent the same distributions.  The correct alpha also depends on the dot size, because that affects the amount of overlap. With smaller dots, C and D look more similar, but the dots are now difficult to see because they are too transparent for this size:

In [None]:
%%opts Points (s=10 alpha=0.1 edgecolor=None)
quartet

As you can see, it is difficult to find settings for the dotsize and alpha parameters that correctly reveal the data, even for relatively small and obvious datasets like these.  With larger datasets, it is difficult to detect that such problems are occuring, leading to false conclusions based on inappropriately visualized data.

## Binning problems

For large enough datasets, plotting every point as above is not always practical.  2D histograms visualized as heatmaps offer a practical way to visualized data compactly, and can also address issues like saturation directly (by effectively auto-ranging the alpha parameter based on the bin with the highest count).  Heatmaps can approximate a probability density function, averaging out noise or irrelevant variations to reveal an underlying distribution.

Here, let's consider a sum of two normal distributions slightly offset from each other:

In [None]:
%%opts Image.Blues (cmap="Blues") Image.Reds (cmap="Reds") Image (cmap="Blues") Image {+axiswise} Points (s=2)

num=600
np.random.seed(42)
offset=1.5
blue_coords = (np.random.normal( offset,size=num), np.random.normal( offset*0,size=num))
red_coords  = (np.random.normal(-offset,size=num), np.random.normal(-offset*0,size=num))
merged = (np.hstack((blue_coords[0],red_coords[0])),np.hstack((blue_coords[1],red_coords[1])))

def heatmap(coords,bins=10):
    hist= np.histogram2d(coords[0], coords[1], bins=bins)
    return hv.Image(hist[0][:,::-1].T)

(hv.Points(merged) + [heatmap(merged,bins) for bins in [8,10,20,100,1000]]).cols(3)

In [None]:
hv.archive.export()

As you can see, the distribution looks very different depending on the number of bins used -- at some heatmap resolutions, the two underlying groups can be distinguished, but in others the shape is unclear. In plot F the data is only dimly visible at all, due to the small pixel-sized bins that make multiple counts per bin unlikely.  In F nearly all the bins look like they are empty, but in fact there are many non-empty bins, they just happen to have fewer dots than a few other random bins with high overlap between datapoints.  This **undersaturation** problem, where values falsely appear to be zero because of plotting parameter settings, can hide data just as seriously as **oversaturation**, leading to incorrect conclusions about the shape of the data.

Clearly, the bin size is now an important parameter that needs manual adjustment.  Yet for truly large data, there's not usually any plot like A available where all datapoints can be seen -- how do you know when your bin size is appropriate for a given dataset?  Usually the answer is "trial and error", which is awkward and time consuming for large data.  The result is...

## For big data, you don't know when the viz is lying

I.e., visualization is supposed to help you explore and understand your data, but if your visualizations are systematically misrepresenting your data because of overplotting, saturation, undersaturation, and inappropriate binning, then you won't be able to discover the real qualities of your data and will be unable to make the right decisions.

The [`datashader`](https://github.com/bokeh/datashader) library has been designed to overcome many of the above problems, by automatically calculating appropriate parameters based on the data itself, and by allowing interactive visualizations of even truly large datasets with millions or billions of data points so that their structure can be revealed.