# Scaling, Performance, and Memory

In this notebook we will work with a multi-machine cluster operating in the cloud.  We will do performance tuning on a workflow that enables interactie visualization, and learn about how to measure and improve performance in a distributed context.  We'll make some pretty images too.


## Request Dask Cluster

There are many services to create Dask clusters in the cloud.  Today we'll use Coiled.

In [None]:
import coiled

cluster = coiled.Cluster(
    n_workers=10,
    package_sync=True,
)

from dask.distributed import Client
client = Client(cluster)

client

## Large Scale GIS Visualization

For our application we'll visualize the taxi pickup locations in the classic NYC Taxi dataset.  

This data is available to us in Parquet format on S3

### Data

In [None]:
# Read in one year of NYC Taxi data

import dask.dataframe as dd

df = dd.read_parquet(
    "s3://coiled-datasets/dask-book/nyc-tlc/2009"
)
df.head()

In [None]:
len(df)

<img src="images/nyc-taxi-scatter.png" align="right" width="40%">

### Plotting large scale data is hard

Let's say we wanted to get a map of where taxi's dropped off passengers.  In principle we'd want something like the following:

```python
df.sample(frac=0.001).plot(
    x="pickup_longitude", 
    y="pickup_latitude", 
    kind="scatter",
)
```

Even at 0.1% downsampling this is still just a big blob of blue.

We can do better.

### Datashader for large scale visualization

[Datashader](https://datashader.org/) is a Python library designed to visualize large datasets.  It also happens to build on Dask.  It renders large volumes of data with better design.

We won't go into how Datashader works in this tutorial (there are excellent resources online) for us it's just a tool to show us that we're processing our data quickly.

In [None]:
import datashader
from datashader import transfer_functions as tf
from datashader.colors import Greys9, Hot
import holoviews as hv
import numpy as np
from holoviews import opts
from holoviews.element.tiles import StamenTonerBackgroundRetina

hv.extension("bokeh")

Greys9_r = list(reversed(Greys9))[:-2]

In [None]:
%%time

# Define plotting parameters
plot_width = int(750)
plot_height = int(plot_width // 1.2)
x_range, y_range = (-74.1, -73.7), (40.6, 40.9)
plot_options = hv.Options(width=plot_width, height=plot_height, xaxis=None, yaxis=None)

# Plot
canvas = datashader.Canvas(
    plot_width=plot_width, plot_height=plot_height, x_range=x_range, y_range=y_range
)
agg = canvas.points(
    df, "dropoff_longitude", "dropoff_latitude", datashader.count("passenger_count")
)
datashader.transfer_functions.shade(agg, cmap=["white", "darkblue"], how="linear")

Datashader plots can be finely customized. Let's try again with some different settings:

In [None]:
datashader.transfer_functions.shade(agg, cmap=Hot, how="eq_hist")

That looks nicer

## Cleaning Up The Data
Something's off about this plot...where is the signature NYC street grid?

In [None]:
df.dropoff_longitude.head()

In [None]:
df.dropoff_longitude.max().compute() , df.dropoff_longitude.min().compute()

Looks like there are some rows with bad location data in the mix. 

Let's filter those out and retry.

In [None]:
df = df.loc[
    (df.dropoff_longitude > -74.1 ) & (df.dropoff_longitude < -73.7) & 
    (df.dropoff_latitude > 40.6 ) & (df.dropoff_latitude < 40.9)
]

In [None]:
canvas = datashader.Canvas()
agg = canvas.points(
    df, "dropoff_longitude", "dropoff_latitude", datashader.count("passenger_count")
)
datashader.transfer_functions.shade(agg, cmap=Hot, how="eq_hist")

## Let's Speed This Up
That works...technically. But it's painfully slow to render. How can we speed this up?

One of the time-consuming tasks here is fetching the data from S3. We can `.persist()` the dataframe into our cluster memory before we render the interactive plot. That should speed things up a bit.

In [None]:
df = df.persist()

Let's try again:

In [None]:
df = df[["dropoff_longitude", "dropoff_latitude", "passenger_count"]].repartition(partition_size="256 MiB").persist()

In [None]:
%%time

agg = datashader.Canvas().points(
    df, "dropoff_longitude", "dropoff_latitude", datashader.count("passenger_count")
)
datashader.transfer_functions.shade(agg, cmap=Hot, how="eq_hist")

## More Data!
You've shown this to your colleagues and they're impressed. But your manager sees opportunity and is curious how much more they can do with this! They point you to another bucket and ask you to visualize the data that's in there: 5 years' worth of NYC taxi data, totalling over 200GB in memory.

In [None]:
df = dd.read_parquet(
    "s3://coiled-datasets/dask-book/nyc-tlc/2009-2013/"
)

You've just learned some neat tricks so decide to persist the data to cluster memory before plotting this one.

In [None]:
df = df.persist()

In [None]:
# select only nyc datapoints
df = df.loc[
    (df.dropoff_longitude > -74.1 ) & (df.dropoff_longitude < -73.7) & 
    (df.dropoff_latitude > 40.6 ) & (df.dropoff_latitude < 40.9)
]

In [None]:
%%time

agg = datashader.Canvas().points(
    df, "dropoff_longitude", "dropoff_latitude", datashader.count("passenger_count")
)
datashader.transfer_functions.shade(agg, cmap=Hot, how="eq_hist")

In [None]:
client.restart()  # oof.  that was bad.

In [None]:
df.dtypes

## Reduce dataset size in memory

In [None]:
df = df[["dropoff_latitude", "dropoff_longitude", "passenger_count"]]

In [None]:
dtypes = {
    "vendor_id": "string[pyarrow]",
    "passenger_count": "int16",
    "trip_distance": "float32",
    "pickup_latitude": "float32",
    "pickup_longitude": "float32",
    "payment_type": "string[pyarrow]",
    "fare_amount": "float32",
    "surcharge": "float32",
    "tip_amount": "float32",
    "tolls_amount": "float32",
    "total_amount": "float32",
}
df = df.astype(dtypes).persist()

In [None]:
client.restart()

In [None]:
df.memory_usage(deep=True).sum().compute()

In [None]:
_ / 1e9

In [None]:
df = df[["dropoff_longitude", "dropoff_latitude", "passenger_count"]]

In [None]:
df = df.persist().repartition(partition_size="256 MiB").persist()

In [None]:
len(df)

In [None]:
cvs.points?

In [None]:
%%time

agg = datashader.Canvas().points(
    source=df, 
    x="dropoff_longitude", 
    y="dropoff_latitude", 
    agg=datashader.count("passenger_count")
)

tf.shade(agg, cmap=Hot, how="eq_hist")