## A demonstration of using Dask to visualize radial velocity data from the Gaia DR2 release

### Visualization imports

In [None]:
import holoviews as hv
import holoviews.operation.datashader as hd
import datashader as ds

### Set up the cluster

Dask provides utilities to build clusters to use in distributed compute jobs.  In this particular case we will use a `LocalCluster`.  This is a cluster running within our container and sharing resources with other processes in this container: e.g. the notebook itself.

In [None]:
from dask.distributed import Client, LocalCluster
import dask.dataframe as dd

This example assumes you've asked for a `large` instance to work in.  That is 4CPUs and 12GB RAM.  Since we have 4 CPUs to work with, we will ask for 4 workers in our cluster with each having a single thread.

In [None]:
cluster = LocalCluster(n_workers=4, threads_per_worker=1)
client = Client(cluster)
client

The link above will take you to the status dashboard for the summary information about the cluster.

Now read the metadata for the parquet files we'll use for the analysis below.  This does not read all the data, but only the metadata for the 100 files in this data set.

> Note that either of the methods in the cell will work when runing at the LDF, however direct posix filesystem access may not work if running in a different environment (e.g. GKE).

In [None]:
try:
    # Use the big dataset if it exists
    df = dd.read_parquet('/project/shared/data/gaia_dr2/gaia_source_with_rv.parquet', columns=['l', 'b', 'radial_velocity'], engine='fastparquet')
    #if reading from the cloud storage bucket, use the following instead
    #df = dd.read_parquet('gs://jupyterlabdemo-gaia-dr2/gaia_source_with_rv.parquet', columns=['l', 'b', 'radial_velocity'], engine='fastparquet')
except FileNotFoundError:
    # These data should exist everywhere
    df = dd.read_parquet('/project/shared/data/rsp_check_data/parquet/gaia_source_with_rv.parquet', columns=['l', 'b', 'radial_velocity'], engine='fastparquet')

The data are stored with galactic longitude running from 0-360 degrees.  This puts the galactic center on the edge of the plot, so define a rotation to the longitude to make the galactic longitude running from -180-180.

In [None]:
def rot_l(x):
    l = x['l']
    if l > 180.:
        x['l'] = l-360.
    return x

Now apply the above function row by row.  This is a lazy operation and will not be performed until we ask for it.

In [None]:
df = df.apply(rot_l, axis=1, meta=(('l', 'float64'), ('b', 'float64'), ('radial_velocity','float64')))

The `compute` method applies all of the computatioins we've asked for and returns a `DataFrame` in memory.  Specifically, just return the `l`, `b`, and `radial_velocity` columns and perform the rotation on the galactic longitude in a row by row fashion.  To watch the progress, follow the link printed in the output of cell 3.

In [None]:
df = df.compute()

Now we can do things like count the number of rows.  This is no longer parallel, but is still fast because the data are in memory.

In [None]:
%%time
len(df)

We are going to produce a map of the `radial_velocity` column aggregated over cells in on the sky.  Setting up the color map, we will set the most negative values to blue and the most positive values to red.

In [None]:
hd.shade.cmap=["darkblue", "red"]
hv.extension("bokeh", "matplotlib")

Set up the points to be aggregated, `kdims` will be the spacial coordinates and `vdims` correspond to the color map.

In [None]:
points = hv.Points(df, kdims=['l', 'b'], vdims=['radial_velocity'])

Now do the aggregation and display.  We specifiy the mean of the `radial_velocity` values in a cell as the value to place in a cell in the aggregated view.  Since we are using bokeh as the rendering library, the standard pan and zoom widgets are available.

In [None]:
%%time
%%opts RGB [width=1000, height=500]
hd.datashade(points, aggregator=ds.mean('radial_velocity'))

We see portions of the sky where the average star is coming toward us (blue) and portions where it is receeding (red).  Because the disk of the galaxy has differential rotaion, material outside the location of the Sun in the disk is revolving more slowly and material interior to the Sun, more quickly.  This manifests as the two embeded dipoles in the figure.

In [None]:
# close down the cluster
cluster.close()