## Datashading a 2.7-billion-point Open Street Map database

Data taken from Open Street Map's (OSM) [bulk GPS point data](https://blog.openstreetmap.org/2012/04/01/bulk-gps-point-data/). This data was collected by OSM contributors' GPS devices, and is stored as a csv of `latitude,longitude` coordinates. The data was downloaded from their website, extracted, converted to use positions in Web Mercator format, and stored in a [castra](https://github.com/blaze/castra) file for faster disk access. To run this notebook, you would need to do the same, as the data files are too large to ship with `datashader`.  Here we'll plot the points using [datashader](https://github.com/bokeh/datashader) and [dask](http://dask.pydata.org/en/latest/), after first loading them:

In [None]:
import dask.dataframe as dd
from dask.diagnostics import ProgressBar, Profiler, ResourceProfiler, visualize
import datashader as ds

In [None]:
df = dd.from_castra('data/osm.castra')
df.tail()

So we have ~2.7 billion points, in Web Mercator coordinates.

### Aggregation

Create a canvas to provide pixel-shaped bins in which points can be aggregated.

In [None]:
bound = 20026376.39
bounds = dict(x_range = (-bound, bound), y_range = (int(-bound*0.4), int(bound*0.6)))
plot_width = 1000
plot_height = int(plot_width*0.5)

cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height, **bounds)

with ProgressBar(), Profiler() as prof, ResourceProfiler(0.5) as rprof:
    agg = cvs.points(df, 'x', 'y', ds.count())

In [None]:
from functools import partial
from datashader.utils import export_image
from datashader.colors import colormap_select, Greys9, Hot, viridis, inferno
from IPython.core.display import HTML, display

background = "black"
export = partial(export_image, export_path="export", background=background)
cm = partial(colormap_select, reverse=(background=="black"))

### Transfer Function

Create an image out of the set of bins, mapping small (but nonzero) counts to light blue, the largest counts to dark blue, and interpolating according to a logarithmic function in between.

In [None]:
import datashader.transfer_functions as tf

In [None]:
tf.shade(agg, cmap=["lightcyan", "darkblue"], how="log")

There's some odd, low-count, nearly-uniform noise going on in the tropics. It's worth trying to figure out what that could be, but for now we can filter it out quickly from the aggregated data using the `where` method:

In [None]:
export(tf.shade(agg.where(agg > 20), cmap=["lightcyan", "darkblue"], how="log"), "OSM_blue", background=None)

In [None]:
export(tf.shade(agg.where(agg > 20), cmap=cm(["lightcyan", "darkblue"]), how="log"), "OSM_black_blue")

In [None]:
export(tf.shade(agg.where(agg > 20), cmap=Hot, how="log"), "OSM_hot")

The result is a decent map of world population, with Europe apparently having particularly many OpenStreetMap contributors. The long great-circle paths are presumably flight or boat trips, from devices that log their GPS coordinates more than 20 times during the space of one pixel in this plot.

### Performance Profile

In [None]:
from bokeh.io import output_notebook
from bokeh.resources import CDN
output_notebook(CDN, hide_banner=True)

In [None]:
visualize([prof, rprof])

Performance Notes:
- On a 16GB machine, most of the time is spent reading the data from disk (the yellow rectangles)
- Reading time includes not just disk I/O, but decompressing chunks of data
- The disk reads don't release the [Global Interpreter Lock](https://wiki.python.org/moin/GlobalInterpreterLock) (GIL), and so CPU usage (see second chart above) drops to only one core during those periods.
- During the aggregation steps (the green rectangles), CPU usage on a four-core machine spikes to around 400%, as the aggregation function releases the GIL. For in-memory data, the entire computation can happen in parallel, and will go much quicker.
- The data takes up 54 GB of memory when uncompressed, but only a peak of around 3.5 GB of physical memory is ever used. This shows that the approach can handle larger-than-memory datasets.  

### Interactive plotting

If you have enough RAM to hold the whole dataset (or are very patient), you can uncomment the `InteractiveImage` line below and run the cell to build an interactive plot where you can select a region for zooming. Without enough RAM, computation has to be done out of core, and it could take several CPU-intensive minutes to process a series of pan and zoom events before the final result will be displayed.



In [None]:
from bokeh.plotting import figure, output_notebook
from bokeh.io import push_notebook
from datashader.bokeh_ext import InteractiveImage
from datashader import transfer_functions as tf

def create_image(x_range, y_range, w, h):
    cvs = ds.Canvas(x_range=x_range, y_range=y_range)
    agg = cvs.points(df, 'x', 'y', ds.count())
    return tf.shade(agg.where(agg > 20), cmap=["lightcyan", "darkblue"], how="log")

p = figure(tools='pan,wheel_zoom,box_zoom,reset', plot_width=plot_width, plot_height=plot_height, **bounds)
           
p.axis.visible = False
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None
#InteractiveImage(p, create_image)