## A demonstration of using Dask to visualize a density map of the full source catalog from the Gaia DR2 release

### Visualization imports

In [None]:
import holoviews as hv
import holoviews.operation.datashader as hd

### Set up the cluster

Dask provides utilities to build clusters to use in distributed compute jobs. In this particular case we will use a `KubeCluster`.  This is a type of cluster that knows how to use the kubernetes API to spawn separate pods for each worker.  The description for each worker is in a special file in the `dask` directory of the home directory.

In [None]:
from dask.distributed import Client, wait
from dask import dataframe as dd
from dask_kubernetes import KubeCluster
import os
cluster = KubeCluster.from_yaml('/etc/dask/dask_worker.yml')

If you're running with debug logging on, it is much too chatty for the dask `distributed` classes.

In [None]:
import logging
for comp in ['','.core', '.comm', '.client', '.scheduler']:
    dcl = logging.getLogger('distributed{}'.format(comp))
    if dcl.getEffectiveLevel() < logging.INFO:
        dcl.debug("Setting loglevel for {} to INFO.".format(dcl.name))
        dcl.setLevel(logging.INFO)

The workers have the same profile as the instance selected to run the notebook.  For this example, we suggest a `large` size with 4 cores and 12GB of RAM.  Scaling to 60 cores is a good number for this demo, so we ask for 15 workers.

In [None]:
_ = cluster.scale_up(15)
client = Client(cluster)
client

The link above will take you to the status dashboard for the summary information about the cluster.

In [None]:
client

Now read the metadata for the parquet files we'll use for the analysis below.  This does not read all the data, but only the metadata for the files in this data set.

> Note that either of the methods in the cell will work when runing at the LDF, however direct posix filesystem access may not work if running in a different environment (e.g. GKE).

In [None]:
try:
    # Use the big dataset if it exists
    df = dd.read_parquet('/project/shared/data/gaia_dr2/gaia_source.parquet', columns=['l', 'b'], index=[], engine='fastparquet')
    #if reading from the cloud storage bucket, use the following instead
    #df = dd.read_parquet('gs://jupyterlabdemo-gaia-dr2/gaia_source.parquet', columns=['l', 'b'], index=[], engine='fastparquet')
except FileNotFoundError:
    # These data should exist everywhere
    df = dd.read_parquet('/project/shared/data/rsp_check_data/parquet/gaia_source.parquet', columns=['l', 'b'], index=[], engine='fastparquet')

We have asked for only a subset of the source catalog.  Specifically, only the galactic longitude and latitude.  We tell the dask client to cache these columns in memory on the worker nodes with the `persist` method to speed up computations in the future.
> Note that the `persist` method is asynchronous, so following cells that interact with the dataframe may not execute until the persist is finished.  Follow along with progress by visiting the link in the output of cell 4.

In [None]:
df = client.persist(df)

We can now do things like count the number of rows.  This is still a parallel computation and you can look at the summary of the execution by going to the link in the output of cell 4.

In [None]:
%%time
len(df)

We are going to produce an aggregate map over cells on the sky.  The default is to simply count up the number of entries in each spacial cell.  To set up the color map, we ask that the smallest numbers be shown in light blue and the largest in darkblue with a linear ramp.

In [None]:
hd.shade.cmap=["lightblue", "darkblue"]
hv.extension("bokeh", "matplotlib")

Set up the points to be aggregated.  Defaults are fine here.

In [None]:
points = hv.Points(df)

Now do the aggregation and display.  The `datashade` method will bin each of our two spacial coordinates and sum the entries in each.  This effectively produces a density map of the sky for all 1.7 billion entries in the Gaia DR2 source catalog.  Since we are using bokeh as the rendering library, the standard pan and zoom widgets are available.

In [None]:
%%time
%%opts RGB [width=800, height=400]
hd.datashade(points)

As of at least `w_2019_41` the following cell emits a warning that the client cannot connect to the scheduler.  This is expected since the close method stops the scheduler as well.

In [None]:
# close down the cluster
cluster.close()